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The Motor Organization of Some Speech Gestures* 

Fredericka Bell-Berti"*" and Katharine Safford Harris 
Raskins Laboratories, New Haven, Conn. 



A body of speech research has been concerned with describing the phenomena 
and postulating the rules for the reorganization of speech gestures. We have 
inspected electromyographic (EMG) data from three speakers of American English 
and one speaker of Swedish in an effort to discover the manifestations of, and 
limits on, reorganization at the motor command level. We have defined this re- 
organization as the merging of EMG activity peaks for sequences of speech sounds. 
Our EMG data were recorded from several extrinsic lingual muscles — the mylohyoid, 
genioglossus, and palatoglossus — and were processed using the Haskins Laborator- 
ies' EMG data system. We have called the muscles we studied "closers" of the 
vocal tract, because they are active for articulations requiring tongue raising. 
The experimental utterances were polysyllabic nonsense words. The muscles 
studied were active for both vowels and consonants included in the experimental 
utterances. Stop and fricative consonants, having inherently different degrees 
of vocal tract closure, were included in the utterances. The figures that follow 
show EMG signals for short sequences of speech sounds occurring within longer 
strings. 

The vertical line at time zero represents the point in the acoustic signal 
that was used to line up the individual tokens for averaging. Representative 
acoustic signals appear above each graph, and are aligned with the zero reference 
point. 

Results 

The palatoglossus muscle is active for subject BG for /u/ and /m/ (Figure 1) . 
The EMG curves for the /-um-/ and /-mu-/ sequences of the utterances are differ- 
ent: there is one peak for the /-uxor/ sequence of /fumpup/, while there are two 
separate peaks for the /-mu-/ sequence of /fupmup/. 

The mylohyoid muscle is active for subjects LJR and KSH for /i/, /s/, and 
/k/ (Figure 2). One peak occurs for the /-ik-/ sequence of /a'pikupa/ (Figure 2a) 



*Paper presented at the 86th meeting of the Acoustical Society of America, 
Los Angeles, Calif., November 1973. 

Also Montclair State College, Upper Montclair, N. J. 

Also City College of the City University of New York. 
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and two peaks occur for the /-ki-/ sequence of /o'pukipo/ (Figure 2b). In the 
lower figure we see one peak of activity for the /-sk-/ sequence of /apul'skupa/ 
and two peaks of activity for the /-ks-/ sequence of /epulk'supa/. 

The genioglossus muscle is active for subjects LJR, KSH, and FBB in these 
utterances for /i/ and /k/ (Figure 3). The primary syllable stress varies in 
these utterances between the second and third syllable. In each case, there is 
one peak of activity for the /-ik-/ sequence and there are two peaks of activity 
for the /-ki-/ sequence. 

Discussion 

We may consider the reorganization of motor commands to be coarticulatory 
events. One type of reorganization may be viewed as anticipatory, where a future 
gesture influences the present one. We wish to discover when the cozmnands for a 
gesture will merge with commands for a future gesture (anticipatory coarticula- 
tion at the motor command level) . 

We rind in all these data that there is one peak of activity for a sequence 
of speech sounds beginning with a more open vocal tract and ending with a less 
open vocal tract, and there are two peaks of activity for a sequence of speech 
sounds beginning with a less open vocal tract and ending with a more open vocal 
tract, regardless of the position of the perceived syllable boundary or primary 
syllable stress. [The finding of coarticulation occurring across syllable 
boundaries has previously been reported by Daniloff and Moll (1968), although 
they were discussing a different type of phenomenon.] 

All of our data to this point were collected from muscles that we have 
called vocal tract closers because their activity increases tongue height (or, 
for the palatoglossus, for nasal gestures, brings the velum down toward the 
tongue). In each case, the motor commands for sequences of speech sounds were 
reorganized (or merged) when the sequence began with a relatively more open vocal 
tract and ended with a more closed vocal tract, /-ik-/, /-sk-/, /-urn-/. Reorgan- 
ization was not observed when the sequence began vith a more closed vocal tract 
and terminated with a more open vocal tract, /-ki-/, /-ks-/, /-mu-/. 

We hypothesize, then, that muscles that are vocal tract closers coarticulate 
only to the greatest constriction. We presume, by analogy, that muscles that are 
vocal tract openers coarticulate to the greatest opening. Viewing any speech 
sequence, closers exhibit anticipatory coarticulation when the vocal tract is 
being closed and openers exhibit anticipatory coarticulation when the vocal tract 
is being opened. Only one set of these muscles, either the openers or the 
closers, may anticipate futur*^ gestures within any sequence. 

Summary 

In summary, we hypothesize that muscles that close the vocal tract coarticu- 
late only to the greatest constriction. And we presume, by analogy, that muscles 
that open the vocal tract coarticulate only to the greatest opening. 

REFERENCE 

Daniloff, Raymond G. and Kenneth Moll. (1968) Coarticulation of lip rounding. 
J. Speech Hearing Res. 11, 707-721. 
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Effect of Speaking Rate on Labial Consonant-Vowel Articulation* 



+ ++ , 

Thomas Gay, Tatsujiro Ushijima, Hajime Hirose, and Franklin S. Cooper 

Raskins Laboratories, New Haven, Conn. 



ABSTRACT 

The purpose of this experiment was to study the effect of speak- 
ing rate on the articulation of the consonants /p/ and /w/ in combina- 
tion with the vowels /i/, /a/, and /u/» Two subjects read a list of 
nonsense syllables containing /p/ and /w/ in all possible vowel-conso- 
nant-vowel combinations with /i/, /a/, and /u/ at both moderate and 
fast speaking rates. Electromyographic recordings from muscles that 
control movements of the lips, tongue, and jaw were recorded simultan- 
eously with high-speed, lateral-view X-ray films of the tongue and 
jaw, and with high-speed, full-face motion pictures of the lips. For 
labial consonant production, an increase in speaking rate is accom- 
panied by an increase in the activity level of the muscle (orbicularis 
bris) and slightly faster rates of lip movement (both closing and 
opening). Vowel production, however, shows opposite effects: an in- 
crease in speaking rate is accompanied by a decrease in the activity 
level of the genioglossus muscle and, as shown by the X-ray films, 
evidence of target undershoot. Jaw movement data show more variable, 
context-dependent effects on speaking rate. Observed differences are 
explained in terms of the muscle systems involved. 



*Expanded version of a paper presented at the 86th meeting of the Acoustical 
Society of America, Los Angeles, Calif., November 4^973. 

"*"a1so Department of Oral Biology, University of Connecticut Health Center, 
Farmington. 

Faculty of Medicine, University of Tokyo; visiting researcher, Raskins 
Laboratories, 1970-1972. 

On leave from the Faculty of Medicine, University of Tokyo. 

Acknowledgment : The authors wish to thank Dr. J. Daniel Subtelny, head of the 
")epari:ment of Orthodontics, Eastman Dental Center, Rochester, N. Y., for the 
use of his cinefluorographic facilities; and Mrs. Constance Christian for her 
help in analyzing tae motion picture dnta. 
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INTROnUCTION 



The way a speaker produces a given string of phones will show a good deal of 
variability depending upon, among othe. chings, the suprasegmental features of 
s-ress and speaking rate. The control of speaking rate is a good illustration of 
the complex nature of these allophonic variations. For example, it is commonly 
known that during faster speech a vowel tends to change color toward the neutral 
schwa (Lindblom, 1963, 1964). Lindblom's original model proposes that this 
neutralization is a consequence of the shorter duration of the vowel and is 
caused by a temporal overlap of motor commands to the articulators. In other 
words, the articulators fail to reach, or undershoot, their targets because the 
next set of motor commands deflects them to the following target before the first 
target is reached. This phenomenon implies further that both the rate of move- 
ment of the articulators (specifically the tongue) and the activity levels of the 
muscles either remain unchanged or are decreased during faster speech. Although 
similar undershoot effects have been observed for other phones (Gay, 1968; Kent, 
1970), a general model of speaking rate control based on timing changes alone is 
too simple. For example, in a recent study of labial consonant production (Gay 
and Hirose, 1973) it was shown that an increase in speaking rate is accompanied 
by both an increase in the rate of movement of the lips and an increase in the 
activity levels of the muscles. Although changes in the timing of commands to 
the muscles do occur, the production of labial consonants during faster speech is 
characterized primarily by an increase in articulatory effort. 

The implication that more than one merlw lism operates to control speaking 
rate might be expected. Whereas vowel eduction involves a movement toward a 
spatial target, the production of most '.c jonants involves a movement towards 
constrictive or occlusal targets. Thus, in a strict sense, the concept of under- 
shoot itself cannot be easily applied to consonant production. Of course, too, 
the phenomena described above are based on a niamber of different experiments; it 
is quite conceivable that some of the differences observed are individual ones. 

This paper represents an attempt to describe these phenomena further by 
studying the effect of speaking rate on the articulation of labial consonants on 
both preceding and following vowels. The specific purpose of the experiment was 
to study the effect of speaking rate on the coordination of lip, tongue, and jaw 
movements during the production of the labial consonants /p/ and /w/ in combina- 
tion with the vowels /i/, /a/, and /u/. The experiment utilized the combined 
techniques of electromyography (EMG) ; cinef luorography; and direct, high-speed 
motion picture photography. 

METHOD 

Subjects and Speech Material 

Speakers were two adult males, both native speakers of American English. 
The speech material consisted of the consonants /p/ and /w/ and the vowels /i/, 
/a/, and /u/ in a trisyllable nonsense word of the form /k Vi C V2 p/ where Vi 
and V2 were all possible combinations of /i/, /a/, and /u/ and C was either /p/ 
or /w/. An additional set of trisyllables of the form /kut V pa/ (V = /i/, /a/, 
/u/) was also constructed. These stimuli were incorporated into the EMG part of 
the experiment to obtain lip-rounding data for /u/ in a nonlabial consonant 
environment. The utterance tjrpes were randomized into a master list. The carrier 
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phrase, "It's a...," preceded each utterance. Two speaking rates were studied: 
slow (normal) and fast. Each speaking rate was based on the subject's own 
appraisal of comfortable slow and fast rates. A brief practice session preceded 
each run. The subjects were also instructed to produce the first two syllables 
with equal stress and the final syllable unstressed. The subjects' performances 
were monitored continuously throughout the run. 

Electromyography 

For both subjects, conventional hooked-wire electrodes were inserted into 
muscles that control movements of the lips, tongue, and jaw. These muscles are 
listed in Table 1. Although all muscle locations showed adequate firing levels 
at the time of electrode insertion, some locations deteriorated at one time or 
another during the run. The extent to which this occurred is also indicated in 
Table 1. 



TABLE 1: EMG ele 

Subject FSC 
Orbicularis Oris (00) 
Genioglossus (GG) 
Internal Pterygoid (IP)"'" 
Anterior Belly Digastric (AD) 

^Analyzed for combined run only. 
2 

Not usable. 



ctrode locations. 

Subject TG 

Orbicularis Oris (00) 
2 

Genioglossus (GG) 

2 

Superior Longitudinal (SL) 

2 

Internal Pterygoid (IP) 
Anterior Belly Digastric (AD) 



The basic procedure was to collect EMG data for a number of tokens of a 
given utterance and, using a digital computer, average the integrated EMG signals 
at each electrode position. The EMG data \^re r(2Corded on a 14-channel instru- 
mentation tape recorder together with the i-icoustic signal and a sequence of digi- 
tal code pulses (octal format). These pulses are used to identify each utterance 
for the computer during processing. A more detailed description of the various 
aspects of the experimental procedure can be found elsewhere (Hirose, 1971; 
Kewley-Port, 1971, 1973). 



Cine f luorography 

Lateral-view X-ray films were recorded with a 16 mm cine camera set to run 
at 64 fps. The X-ray generator delivered 1 msec pulaes to a 9 inch image intensi- 
fier tube. The subject was seated with his head positioned in a standard head- 
holder. A barium sulfate paste was used as a contrast medium on the tongue, and 
tantalum was applied along the midline of the nose, lips, and jaw to outline 
those structures. The X-ray film records were synchronized with the other records 
by a pulse train generated by the camera and recorded on the data tape. 
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High-Speed Motion Picture Photography 



High-speed motion pictures of lip movements were recorded with a 16 mm 
Mllllken camera set to run at 128 fps. Because of opace constraints the full- 
face motion pictures of the lips were recorded through a mirror. The motion 
picture and EMG data were synchronized by an annotation system that displayed the 
octal code pulses on an LED device placed in the path of the camera. This dis- 
play was also driven by a signal from the camera to count Individual frames be- 
tween octal codes. Before the run, white reference dots were painted on the sub- 
jects' lips at the midline. A scale was fixed to the mirror for calibration of 
lip movement measurements. A block diagram of the recording system is shown in 
Figure 1. 

Data Recording and Analysis 

The combined EMG/cinefluorographlc/hlgh-speed motion picture data were re- 
corded at the beginning of the run, after which the EMG part of the experiment 
continued. For the second segment of the run (EMG only), the word list was re- 
peated ten times at each of the two speaking rates. The EMG data from the com- 
bined-techniques part of the run were processed separately from the remainder of 
the run. This allowed comparisons to be made between the individual (combined 
run) tokens and the averaged data. 

The X-ray films were fi^^nalyzed by frame-by-frame tracings to obtain the out- " 
line of the surface of the tongue as well as a direct measurement of jaw dis- 
placement (vertical distance between the incisors). The direct-view motion 
picture films were analyzed by frame-by- frame measurements of vertical lip open- 
ing at the midline. All film measurements were made on a Perceptoscope film 
analyzer . 

Duration measurements were made from the Vislcorder tracings. The mean 
durations of the utterances (token +, carrier) were 980 msec and 650 msec for 
Subject FSC, and 1,030 msec and 670 msec for Subject TG, for the slow and fast 
speaking rates, respectively. 

RESULTS 

Lip Movement 

Results of the electromyographic analyses for both subjects are summarized 
in Table 2. This table shows the peak activity levels of the orbicularis oris 
muscle for all utterances at both speaking rates. The orbicularis oris muscle is 
largely rssponsible for a closing gesture of the lips and its activity, as shown 
here, is associated with lip closure during the production of the consonant. In 
Table 2, Ci represents the first consonant (either /p/ or /w/) and C2 represents 
the final /p/ in the utterance. 

Generally speaking, the data summarized in this table show that the peak 
muscle activity levels of the orbicularis oris are greater during the fast speaking 
rate condition than during the slow speaking rate condition. These differences, 
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TABLE 2: Averaged and single token (in parentheses) peak EMG values (yv) for the 
orbicularis oris muscle. Values for the slow speaking rate are in the 
left column and values; for the fast speaking rate are in the right col- 
umn of each cell. An asterisk (*) indicates higher values for the slow 
speaking rate condition. 



Subject FSC Subject TG 





S ^^F 


S 






ipip 

IT * 


120-185 
(110-195) 


130-140 
(90-110) 


205-280 
(220-270) 


200-210 
(180-240) 


ipap 


150-205 
(115-120) 


125-165 
(80-90) 


225-325 
(210-285) 


240-300 
(225-290) 


ipup 


175-215 
(150-175) 


135-150 
(145-170) 


215-345 
(190-305) 


185-345 
(165-315) 


aoio 


140-220 
(130-145) 


140-145 
(80-85) 


235-270 
(235-265) 


215-225 
(220-235) 


apap 


145-225 
(120-125) 


130-155 
(120-190) 


220-270 
(215-295) 


245-260 
(240-255) 


apup 


170-205 
(140-150) 


145-145 
(120-200) 


220-270 
(205-310) 


175-255 
(175-250) 


uD Id 


130-185 
(120-180) 


130-145 
(100-160) 


135-290 
(155-240) 


210-260 
(200-250) 


upap 


150-240 
(95-200) 


120-150 
(90-95) 


155-245 
(145-230) 


100-210 
(105-195) 


up up 


110-265 
(115-205) 


140-170 
(95-95) 


165-245 
(145-250) 


180-195 
(180-195) 


iwip 


150-230 
(115-175) 


130-170 
(115-95)* 


175-325 
(180-300) 


225-235 
(230-230) 


iwap 


150-240 
(150-160) 


130-165 
(95-95) 


(175-295) 


250-215* 
(240-230)* 


iwup 


200-275 
(130-220) 


165-170 
(95-110) 


(160-295) 


(190-290) 




155-195 
(130-225) 


125-155 
(95-70)* 


(140-240) 


(235-215)* 


awap 


1/3-205 
(165-175) 


IOC ICC 

125-155 
(120-85)* 


135-240 
(120-235) 


230-190* 
(235-225)* 


awup 


190-220 
(160-200) 


165-165 
(155-155) 


155-275 
(165-230) 


225-255 
(225-240) 


uwip 


155-190 
(120-240) 


125-140 
(95-105) 


105-205 
(95-180) 


215-205* 
(205-190)* 


uwap 


150-245 
(150-265) 


125-170 
(105-90)* 


100-235 
(115-220) 


240-215* 
(255-200)* 


uwup 


160-180 
(160-220) 


140-150 
(95-85)* 


85-240 
(110-205) 


190-270 
(200-250) 
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with only a few exceptions, hold for both Ci and C2 and for the single tokens 
(from the combined run) as well as the averaged tokens. The magnitude of these 
Increases, however, varies from nil to over 100 percent; the only consistent 
trend is for Ci differences to be greater than C2 differences. This, of course, 
is a probable stress effect. On the whole, these data, like those from a previ- 
ous study (Gay and Hirose, 1973), demonstrate that the major effect of an in- 
crease in speaking rate for labial consonant production is an increase in articu- 
latory effort. 

A compatible result was obtained from the frame-by-frame analysis of the 
direct-view, high-speed motion pictures. Figure 2 shows the vertical lip open- 
ing measurements for the /a/ series of utterances. These data show that the 
rates of lip closing and opening for the consonant are slightly faster for the 
faster speaking rate condition. These effects occur for both Cl and C2 and for 
all vowels, with the exception of the /upu/ and /uwu/ series for Subject TG. The 
increase in articulatory speed for the consonant is also carried over to the 
adjacent vowel as greater lip opening. Lip closure duration is somewhat variable 
across changes in speaking rate (Table 3), although in most cases it decreases 
with an increase in speaking rate. 



TABLE 3: Lip closure durations (msec) for p^j rounded to the nearest 5 msec. 





Subject FSC 


Subject 


TG 




Slow 


Fast 


Slow 


Fast 


ipi 


80 


60 


80 


60 


Ipa 


100 


100 


100 


80 


Ipu 


90 


60 


100 


70 


apl 


100 


70 


60 


60 


apa 


70 


70 


80 


80 


apu 


110 


80 


80 


70 


upi 


120 


80 


110 


70 


upa 


110 


90 


100 


80 


upu 


80 


70 


90 


60 



The effects that occur for /p/ also occur for /w/, i.e., an increase in 
speaking rate is accompanied by both an increase in the activity level of the 
muscle and an increase in the speed of movement of the lips. Further, the tar- 
get configuration of the lips (minimum lip opening) remains constant across both 



The instances where an increase in muscle activity level was associated with the 
slower speaking rate condition occurred for 10 utterances out of a total of 144. 
Reversals of the expected result were generally small and occurred only for the 
final /p/ in the /w/ utterances. 
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speaking rates. In other words, the lips do not undershoot the /w/ target during 
faster speech. An example of the /w/ curves is shown in Figure 3. 

The EI4G data also show both contextual and individual differences in labial 
consonant production. For those utterances where both consonants are /p/, the 
EMG data for Subject FSC show higher peaks for p^ than p2; however, for Subject 
TG, the opposite is usually the case: p2 peaks are usually (although not always) 
higher than peaks. For both subjects, muscle activity differences are condi- 
tioned more by spaaking rate than by position. In other words, the differences 
in muscle activity levels as a function of speaking rate are greater than differ- 
ences in muscle activity levels associated with position (C^ vs C2) . Further, 
these data do not show any consistent vowel effects on muscle activity levels for 
consonant closure, although displacement differences were evident in the films 
(lip opening was greatest for /a/, less for /i/, and least for /u/). This, of 
course, might be attributed to a trade-off between displacement (/a/ and /i/) and 
degree of rounding (/u/), or to a trade-off between jaw opening and lip opening. 

In summary, both the EMG and motion picture data show that the major effect 
of an increase in speaking rate on the production of a labial consonant is an 
increase in articulatory effort and a corresponding increase in the speed of ar- 
ticulatory movement. Both effects imply a reorganization of the commands to the 
articulators as well as a change in the timing of those commands. 

Tongue Movement 

The EMG data for the genioglossus mtiscle of Subject FSC are summarized in 
Table 4. The genioglossus muscle, which makes up the bulk of the tongue body, 
is responsible for both protruding and bunching movements of the tongue.*^ The 
data in this table show that the activity levels of the genioglossus muscle de- 
crease during faster speech. This decrease occurs for all utterances (except for 
/a/, where the genioglossus muscle shows only resting potentials). The magnitude 
of these differences is clearly vowel dependent, with the greatast differences 
occurring for /i/, less for /u/, and, of course, none for /a/. The activity 
patterns of the muscle at each speaking rate show large and consistent vowel 
effects in the same directions. This latter finding is a rather common pattern 
that has been shown before (Smith, 1970; Harris, 1971). Both the vowel and 
speaking rate effects occur systematically and hold up across /p/ and /w/ as well 
as across both the first and second vowels. 

The X-ray films clearly reflect the decrease in muscle activity during 
faster speech. Figure 4 shows the position of the tongue (Subject FSC) at the 



This electrode location was unusable for Subject TG* 

^The genioglossus muscle is classically divided into two muscle groups. The an- 
terior fibers fan from the mandibular tube.rcle to subsurface points along the 
length of the tongue. The posterior fibers run longitudinally from the mandibular 
tubercle to the hyoid bone. Depending upon the precise location of the record- 
ing electrodes, different response patterns can be observed for different fea- 
tures. Based on the specific patterns of activity observed for this location, 
it is assumed that the electrodes were placed in the posterior or lower anterior 
fibers of the muscle. 
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TABLE 4: Averaged and single token (in parentheses) peak EMG values (uv) for the 
genioglossus muscle for subject F.S.C. Values for the slow speaking 
rate condition are in the left column and values for the fast speaking 
rate condition are in the right column of each cell. 

Subject FSC 

VI V2 
S F S F 

ipip 355-205 335-275 

(450-405) (490-370) 

ipap 325-105 
(440-195) 

ipup 360-180 155-125 

(495-285) (310-200) 

apip - 350-265 

(495-480) 

apap — - 

apup - 160-130 

(220-210) 

upip 230-125 315-300 

(275-180) (425-310) 

upap 185-105 
(240-3 30) 

upup 195-105 165-165 

(205-150) (200-195) 

iwip 295-180 305-285 

(380-340) (370-360) 

. iwap 295-110 
(360-210) 

iwup 305-175 165-145 

(410-390) (265-230) 

awlp - 310-300 

(490-350) 

awap - - 

awup - 165-165 

(420-400) 

uwip 130-90 350-315 

(195-175) (485-460) 

uwap 140-110 
(160-150) 

uwup 125-85 200-150 

(175-145) (245-190) 
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target, the time of lip closure for /p/, and the V2 target for the /a/ series of 
utterances at both speaking rates. It is evident from these tracings that the 
tongue does not extend as far for the vowel during fast speech as it does during 
slow speech; it clearly undershoots its target. Tongue undershoot cocurs consis- 
tently for both subjects (although with a greater magnitude for FSC) and for both 
Vi and This shortened course of movement is also reflected in the position 

of the tongue at the time of /p/ closure. During fast speech the tongue is in a 
lagging (or more neutral) position compared to the tongue during slow speech. 
These positional relationships are most obvious for the /i/-/a/ contrasts, appar- 
ently because of their more directly opposite target positions. 

Although it is apparent that during faster speech the tongue follows a 
shorter, more restricted course from vowel to vowel, the film data could not be 
used to quantify articulatory rates of movement because of the measurement tech- 
nique used. However, coupled with the EMG data for Subject FSC, the film data 
suggest that articulatory rates of movement might very well remain constant 
across changes in speaking rate. This assumes, of course, that the decrease in 
EMG activity reflects only a decrease in articulator displacement and not a con- 
current decrease in articulator velocity. This suggestion is in agreement with 
the findings of both Lindblom (1964) and Kent (1970), whose data show little 
effect of speaking rate on articulatory velocity. However, such a statement is 
not necessarily universal. For example, Kuehn (1973) has shown that different 
speakers seem to use different strategies in the control of speaking rate. His 
data show that some speakers increase speaking rate by increasing articulatory 
velocity (with a corresponding decrease in amount of undershoot), and others, by 
decreasing articulatory displacement (with a corresponding decrease in articula- 
tory velocity) . 

In this experiment, the data for tongue movement differ markedly from those 
for lip movement. Whereas the tongue shows a decrease in muscle activity and 
target undershoot during faster speech, lip movement is characterized by an in- 
crease in muscle activity and an increase in articulatory speed. The question 
then arises as to whether such differences are phoneme related or muscle system 
related. The EMG data on lip rounding for /u/ bear directly on this question. 
Table 5 shows the averaged peak muscle activity levels of the orbicularis oris 
during lip rounding for /u/. Thesle figures show that, for both subjects and all 
utterances, the lip rounding gesture for /u/ is characterized by consistently 
higher peaks of muscle activity during the faster speaking rate condition. Thus, 
it would appear that for the phoneme sequences studied here, the observed vowel- 
consonant differences are, in fact, tongue-lip differences. 



TABLE 5: Averaged peak EMG values (pv) of the orbicularis oris muscle for lip 
rounding of /u/ . 





Subject FSC 


Subject TG 




Slow 


Fast 


Slow Fast 


utl 


120 


150 


80 145 


uta 


115 


165 


80 150 


utu 


145 


185 


75 140 
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The X-ray tracings can also serve to illustrate the precision with which 
vowel targets are attained, irrespective of preceding or following vowels. 
Figure 5 shows vowel target positions of the tongue for the /a/ series of utter- 
ances. Although the routes taken by the tongue toward these targets vary consid- 
erably, the final target position is consistently stable. This finding, which is 
in agreement with whose of Houde (1967) and MacNeilage and DeClerk (1969) , char- 
acterises the and V2 targets for /p/ as well as for /w/. 

Jaw Movement 

The EMG data for the anterior belly of the digastric muscle are summarized 
in Table 6.^ The anterior belly of the digastric acts to open the jaw; thus, the 
values shown in this table are associated with jaw opening for V-^ and 

The muscle activity levels of the anterior belly of the digastric show an 
increase during fast speech. This increase is consistent (small in magnitude for 
Subject FSC, large in magnitude for Subject TG) , but occurs only for the first 
vowel. For both subjects, this muscle does not show much more than resting po- 
tentials for It is possible, of course, that these peaks reflect, at least 
in part, a stabilizing gesture for /k/. However, the timing of the peaks is com- 
patible with the vowel, and further, the data for TG show a consistent vowel 
effect, i.e., large peaks and greater speaking rate differences for /a/-/u/-/i/, 
increasing in that order. 

Figure 6 illustrates the effect of speaking rate on jaw movement. For the 
vowel, jaw displacement is usually greater during slow speech than during fast 
speech, but for the consonant, jaw displacement is usually less during fast 
speech. In other words, jaw movement during fast speech mirrors jaw movement 
during slow speech, but with lower absolute displacement values. These data are 
not consistent with recent findings of Abbs (1973), whose measurements for /pip/ 
and /paep/ show little effect of speaking rate on jaw displacement. 

Figure 6 also shows that for the opening segment of the consonant, jaw open- 
ing consistently leads lip opening (lip opening is shown by the vertical lines on 
each graph). This lead effect, which is evident for both subjects, will be de- 
scribed more fully in the following section. Finally, this figure illustrates 
that the effect of speaking rate on jaw velocity varies with context. Although a 
trend seems to exist for rate of movement to increase during faster speech for 
/a/, and sometimes for /i/, no such trend is apparent for /u/. The effects for 
/a/ occur most consistently while effects for /i/ are more variable. The absence 
of any effect for /u/ is obviously because jaw movement for /u/ is miminal at 
both speaking rates. 

Jaw movement is also subject to certain anticipatory coarticulation effects. 
Figure 7 shows the jaw displacement curves for all /p/ utterances at the slow 
speaking rate. These curves are arranged so that the data for the first vowel 



The internal pterygoid muscle was also studied for both subjects. However, for 
Subject TG this muscle did not show any activity during speech (although it was 
active for clenching), and for Subject FSC activity was present only during 
closure for initial /k/. 
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TABLE 6: Averaged and single token (in parentheses) peak EMG values (yv) for the 
anterior belly of the digastric muscle. Values for the slow spe.-rking 
rate condition are in the left column and values for the fast speaking 
rate condition are in the right column of each cell. An asterisk (*) 
indicates higher values for the slow speaking rate condition. 





Subject FSC 


Subject TG 




VI V2 


S ^^F 






S F S F 




S F 


ipip 


110-135 


120-280 


15-15 




(90-135) 


(110-130) 


(30-40) 


ipap 


115-120 


120-240 


30-75 




(130-135) 


(120-260) 


(25-30) 


inuD 


110-110 


115-265 


25-35 




(125-180) 


(100-265) 


(30-35) 


an in 


105-170 


165-400 


30-45 




(145-220) 


(170-420) 


(25-25) 


apap 


135-155 


140-375 


25-35 




(120-165) 


(130-370) 


(20-30) 




120-130 


200-385 


30-30 




(95-180) 


(185-375) 


(30-35) 






130-290 


25-95 




(115-125) 


(125-265) 


(30-40) 


nT>iiT> 




120-305 


30-85 




(105-25)* 


(115-275) 


(30-75) 


nT>nT^ 

ULJU L/ 




125-285 


30-50 




(125-140) 


(125-295) 


(25-30) 




125-145 


135-255 


30-60 




(110-140) 


(110-230) 


(30-45) 




120-125 - 


115-250 


35-70 




(125-200) 


(140-230) 


(30-60) 




110-125 - 


125-260 


30-35 




(115-160) 


(145-255) 


(45-50) 




125-125 


190-390 


30-55 




(135-140) 


(170-370) 


(25-60) 


awap 




160-385 


25-45 




(125-130) 


(135-350) 


(25-40) 


awup 


115-110* 


175-390 


25-35 




(120-150) 


(170-410) 


(35-40) 


uwip 


105-130 


135-375 


25-95 




(115-115) 


(125-410) 


(25-90) 


uwap 


110-115 


135-300 


25-70 


(95-115) 


(140-300) 


(25-90) 


uwup 


3 05-135 


140-330 


25-50 


(100-110) 


(165-335) 


(35-40) 
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Figure 6: Jaw displacement measurements for slow (filled circles) and fast 
(unfilled circles) speaking rates. '0' on the abscissa refers to 
time of lip closure for /p/ and the short vertical lines Indicate 
time of lip opening for /p/. 

23 



FSC TG 



£ 
E 



20 



• • o % 

°D°onnO" n 



• « ° 



o o 

O mO 



o 

o 

o o 



ipi • 

i pa ° 
ipuL n 



UJ 

o. 
O 



20 



C5 



•J'^'flM ••••• 

□ 



CD 



56 



o o 



o 



api • 

apa o 

apuu □ 



< 



20 



o °o 



o o 



o 



BBB 



-250 



000 



&• o 
u_ DCk 



250 -250 



• ••• • • 

0 



250 



api • 

uipa ° 
apuLo 



DURATION (msec) 



Figure 7: Jaw displacement measurements for all /p/ utterances at the slow 
speaking rate. *0* on the abscissa indicates time of lip closure 
for /p/. 
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in the VCV sequence can be compared to differences in the second vowel. Except 
for the /u/ series for Subject TG, jaw displacement for is least when V2 is 
/u/ . Also, in most cases, the first vowel has relatively little effect on jaw 
position at the time of /p/ closure for Subject FSC, and almost no effect for 
Subject TG. Displacement differences for the vowel do not occur consistently 
between /i/ and /a/, although there is a trend for greater jaw displacement for 
Vi when V2 is /a/ rather than /i/. 

Figure 8 shows these data replotted so that second-vowel comparisons can be 
made, i.e., the effect of a different first vowel on the displacement for the same 
second vowel. For these comparisons, individual differences are evident. For 
Subject FSC, large /u/ effects on jaw displacement for the second vowel occur con- 
sistently, i*e., jaw displacement for the second vowel is least when the first 
vowel is /u/ . These curves also show an effect of the first vowel on jaw closure 
for /p/. Here systematic differences in degree of jaw closing for the consonant 
occur, with greater closure for /u/-/i/-/a/, increasing in that order. These 
effects are not so apparent for the second vowel (the comparisons in Figure 6). 
For Subject TG, on the other hand, there are essentially no effects of the first 
vowel on jaw displacement for the second vowel. Also, there do not seem to be 
any anticipatory effects for the second vowel at the time of /p/ closure. The 
jaw movement data for /w/ are essentially the same as for /p/, i.e., they show 
reduced displacement for the vowel and increased closing for the consonant during 
faster speech, and decreased displacement for a vowel when either proceeded or 
followed by /u/. 

In summary then, it would seem that jaw movement during fast speech is 
characterized by a pattern similar to that for slow speech, but with a decrease 
in overall displacement. Changes in the velocity of jaw movement, when they 
occurred, did so primarily for /a/. The movement data were only partially sup- 
ported by the EMG data (for Subject TG, where greater activity levels for the an- 
terior belly of the digastric muscle correlated with the increase in velocity of 
the jaw for /a/). Of course, too, this experiment sampled only two of the muscles 
involved in movements of the jaw; this in itself, plus the absence of any activ- 
ity for the second vowel in the test utterances, clearly indicates that a more com- 
plete muscle inventory is needed for an adequate description* 

Coordination of Lip, Tongue, and Jaw Movements 

Table 7 summarizes timing information for movements of the tongue, jaw, and 
lips during the closing, closed, and opening segments of the consonant. The data 
in this table are the relative onset times (lip closure for /p/ or minimum lip 
opening for /w/ = 0) of tongue movement from the first vowel, and lip and jaw 
opening and closing for the consonant, at both speaking rates. 

For both subjects and both speaking rates, the onset of jaw closing for /p/ 
lags behind the onset of both tongue movement and lip closing. The onset time of 
jaw closing from /a/ is earlier than the onset times of either /i/ or /u/. This 
is probably because jaw displacement for /a/ is greater than that for the other 
two vowels. The lips begin to close approximately 75 msec ahead of tongue move- 
ment when the vowel preceding the consonant is /u/. This is apparently part of 
the lip-rounding gesture for /u/. 
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Figure 8: Jaw displacement curves for /p/, replotted for second vowel 
comparisons. 
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Jaw closing is usually completed at or slightly after lip closure for /p/, 
while jaw opening for the following vowel precedes lip opening by approximately 
50 msec. Except for shorter lead and lag times, similar patterns emerge for fast 
speech. 

Segment durations are somewhat longer for /w/ than /p/ (50 msec for Subject 
FSC, 25 msec for Subject TG) . Also, the onset times of lip movement for /w/ 
(with two exceptions) are earlier than those for tongue movement. Jaw closing is 
completed at or slightly ahead of minimum lip opening, while jaw opening begins 
at about the same time as lip opening. Closure duration is shorter for /w/ than 
/p/. 

The absence of an .anticipatory movement of jaw opening during closure for 
/w/ contradicts recent data of Gay and Hirose (1973), who showed that for /w/ jaw 
movement was independent of lip movement, anticipating a following vowel by open- 
ing for it during lip closing. This effect was not evident, of course, in the 
present X-ray data, the examination of which easily explains the discrepancy. In 
their experiment. Gay and Hirose measurea'"j aw ""displacement by the movement of a 
marker painted on the chin. The present X-ray films, however, show that flesh 
points directly over the mandible can move independently from, and even in oppo- 
site directions to, the mandible itself. Thus, indirect measurements such as the 
above can produce seemingly accurate, but in fact erroneous data. It is also 
possible that similar errors are inherent in strain gage measurements > especially 
if the transducer is positioned at the level of the mandibular protuberance. 

The major results of this experiment can be summarized as follows. For lip 
movements associated with either labial consonant production or rounding for the 
vowel, an increase in speaking rate is accompanied by an increase in the activity 
level of the muscle and by slightly faster rates of movement. For tongue move- 
ment during vowel production, an increase in speaking rate has the opposite 
effect: a decrease in the activity level of the muscle and a decrease in articu- 
latory displacement. For jaw movement, the major effect of an increase in speak- 
ing rate is a decrease in the displacement of the jaw throughout the utterance, 
i.e., for both the vowel and consonant. Jaw movement is also more sensitive than 
either lip or tongue movements to changes in phonetic context and shows a lag 
effect for consonant closing and a lead effect for vowel opening. 

DISCUSSION 

The results of this experiment show that the control of speaking rate cannot 
be accounted for by one simple mechanism. As was shown in a previous experiment 
(Gay and Hirose, 1973), and confirmed here, the major effects of an increase in 
speaking rate on the production of a labial consonant is an increase in the activ- 
ity level of the muscles and an increase in the rate of movement of the lips. 
Both of these effects are apparent consequences of an increase in articulatory 
effort. As was also mentioned before, a strategy of this type could be expected 
for a consonant gesture that involves an occlusal target. The articulators must 
approximate for a stop consonant. Thus, it is reasonable to assume that under 
the constraints of an increase in speaking rate, they would do so somewhat faster. 



For /w/, the articulators apparently must reach an invariant target position in 
order to produce an acoustic steady state. 
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The data for the tongue (and to some extent the jaw), however, cannot be 
explained in the same way. For tongue movement, an increase in speaking rate is 
accompanied by a decrease in displacement (undershoot) and a decrease in the 
activity level of the muscle. The decrease in the activity level of the genio- 
glossus muscle shows that undershoot is not, as Lindblom (1963) originally sug- 
gested, a consequence of an overlap in the timing of commands to the muscle. 
Although the decrease in muscle activity for the tongue during faster speech 
might reflect only the decrease in overall displacement of the tongue (and not 
any changes in its speed of movement), the decrease itself is indicative of some 
reorganization taking place at the level of the muscle commands. 

The EMG data for the tongue indicate that vowels are characterized by dif- 
ferent targets for slow and fast speech. In other words, a vowel target cannot 
be operationally defined by a set of invariant spatial coordinates. Rather, a 
vowel target must be defined either by a multiple coordinate system or by an 
articulatory field (with limits). For slow speech, and perhaps for stressed 
vowels, one given set of coordinates is aimed for, while for fast speech, where 
articulatory expediency or the constraints of decreased jaw displacement place 
additional demands on the mechanism, a different set of coordinates is aimed for. 

As mentioned above, Kuehn (1973) has shown that different speakers use dif- 
ferent strategies to increase speaking rate, i.e., by trade-offs between dis- 
placement and velocity. These differences might also be explained in terms of a 
field or multiple coordinate system. Some individuals might be able to produce a 
given vowel with a greater degree of freedom than others. That is, the acousti- 
cal properties of a given vocal tract might be such that a wider range of for- 
mants can produce the same perceptual result. Other tracts might not have these 
characteristics and, thus, the articulators must (by increasing velocity) attain 
a more strictly defined set of target coordinates. However, irrespective of both 
the strategy employed and the effect observed, the crucial point is that the 
speaking rate control mechanism involves changes in both the timing and organiza- 
tion of commands to the muscles. 

One final point which r^uires reiteration is the contrast between the vari- 
ability of target position across changes in speaking rate versus the precision 
in attaining that same target in varied phonetic contexts. The larger speaking 
rate effects can be interpreted to mean that the greatest challenge to the system 
lies not in the organization of a target-directed movement, but rather, in the 
rapid sequencing of such movements. 

In summary, the data of this experiment show that speaking rate is controlled 
by a mechanism that involves more than a simple reordering of the timing of motor 
commands. Reorganization of the gesture for fast speech involves changes in both 
the duration and size of the muscle contraction. However, an adequate descrip- 
ti on of the mechanism requires additional information about how the tongue han- 
dles vowel sequences that are separated by both lingual and velar consonants, 
along with a more detailed description of the variability of target position as a 
function of a greater number of speaking rate variations. 



Lindblom' s original model implies that only the timing, not the size, of the 
EMG signal would change during faster speech. 
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Stress and. Syllable Duration Change* 

a. 

Katherine S. Harris 

Haskins Laboratories, New Haven, Conn. 



It is generally assumed that underlying each phoneme there is an invariant 
articulatory target. At a surface level, this statement is, of course, untrue. 
There is no moment when the articulators assume a position for a given speech 
sound — a position that is invariant over changes in the phonemic context and supra- 
segmental structure. Much of the effort of traditional articulatory phonetics 
was directed towards writing rules to describe observed differences in articula- 
tory target as a change in allophone selection. Modern physiological research 
searches for simple rewrite rules to derive observed positional variants from 
some presumed underlying single articulatory target. 

A carefully worked out theory of this sort is Lindblom's (1963) theory of 
vowel reduction, which was developed to account for the changes in vowel color 
that accompany changes in stress. If a vowel is destressed, it will tend to be 
of shorter duration, and to move in vowel color towards the neutral schwa; the 
latter phenomenon is called vowel neutralization. Lindhlom's proposal is that 
the neutralization is a consequence of the accompanying shortening. Briefly, in 
a consonant-vowel-consonant (CVC) sequence, although the signals sent to the 
articulators are constant, the response of the articulators is sluggish. If 
signals arrive at the muscles too fast, the articulators will start towards the 
vowel target but will be deflected towards the subsequent consonant target — that 
is, there will be undershoot. Lindblom tested his theory by having subjects 
produce sentences containing CVC monosyllables. rThe effect of rearranging the 
sentences was to change the stress on one "word" and consequently to change the 
vowel duration. He made careful measurements of the most extreme positions of 
the first and second formants, as a function of the vowel length. He found that 
as vowels lengthened, the formants tended towards asymptotic values which could 
be described as targets. Equations could be written describing the relation of 
vowel duration to the departure of formant position from target. 

Lindblom* s theory seems to be elegant and testable, if one substitutes for 
"signals" the more specific "muscle contractions." A reformulation in electro- 
myographic (EMG) terms would then perhaps be: "Under conditions of changing 
stress the EMG signals associated with a CVC sequence will remain constant in 
amplitude. Only the relative timing of vowel and consonant signals will change." 



*Presented to the American Association of Phonetic Sciences, Detroit, Mich., 
11 October 1973. 

^Also the Graduate Center, City University of New York. 

[HASKINS LABORATORIES: Status Report on Speech Research SR- 35/36 (1973)] 
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Lindblom goes further than this. He assumes that changes in duration oper- 
ate in the same manner whether they are due to stress or to speaking rate* Gay 
(1973) has been investigating some aspects of this formulation with respect to 
speaking rate. I will talk about stress. 

As our model for changing stress, we constructed some two-syllable nonsense 
words, of the form /pVCVp/. The two middle vowels were always /i/, /a/, or /u/, 
and were always different. The first and last consonants were always /p/, and the 
middle consonant was /p/ or /k/. There was a neutral carrier on each end. Using 
this format, we examined EMG signals from several muscles involved in the articu- 
lation. 

The genioglossus muscle will be discussed first. This is a large muscle 
making up a great part of the body of the tongue, acting to bunch it. Conse- 
quently, activity is always seen for the vowel /i/, usually is seen for /u/, and 
none is seen for /a/. No activity is seen for the consonants. 

The effect of the change in stress is shown in Figure 1. (Time runs along 
the abscissa; each unit is 100 msec. Averaged energy is on the ordinate. The ver- 
tical line shows the point when the voicing for the first syllable ends.) The 
two curves in each graph show the two conditions of stress. In each graph the 
thin line shows the utterance with the first syllable stressed and the thick 
line shows the utterance with the second syllable stressed. The line rows show 
two different electrode insertions, both into the genioglossus. The curves 
show the two effects of stress usually found. First, there is a small difference 
in peak height between stressed and unstressed vowels (Harris, 1971). 

The second effect is the "lineup" effect. Notice in the left column, that 
when the /i/ is stressed, the genioglossus curve begins earlier,, but ends at the 
same time with respect to the offset of voicing. The effect is almost symmetri- 
cal when the /i/ is in the second syllable. In other words, the vowal activity 
lengthens, but dies off in a constant relationship to the offset of voicing. 

Figure 2 illustrates the activity of the orbicularis oris muscle, which 
shows a burst of activity for the first, middle, and last /p/ in these utterances 
since it acts to close the lips. Note that the middle /p/ peaks over the offset 
of voicing, indicated by the vertical line. Again, the first /p/ moves left- 
wards, as the vowel lengthens, when the first syllable is stressed. The last /p/ 
moves rightwards, as the vowel lengthens, when the last syllable is stressed. 

What does this mean? First, the activity for the vowel lengthens. Second, 
the time between consonant peaks changes systematically. Combining these facts, 
we get a picture of stress change illustrated by Figure 3. 

This figure shows the relationship between orbicularis oris and genioglossus 
activity, for four disyllables. In all cases, the vowel activity begins as the 
initial consonant activity wanes. If the vowel is stressed, its activity contin- 
ues for a longer period than if it is not stressed. The middle or terminal con- 
sonant begins as the vowel activity wanes. The vowel seems to lengthen literal- 
ly — that is, associated muscle activity lasts longer. The temporal relationship 
of consonant and vowel activity seems to be fixed. 

Lindblom' s model, then, is wrong on two counts — it posits, first, that under 
conditions of changing stress the signals to the muscles will remain constant. 
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and second, that the temporal relationship between consonant and vowel signals 
will change. What is found is a difference in the peak amplitude of the signals 
to the relevant muscles while the relationship botween consonant and vowel sig- 
nals remains constant. 

A closely parallel observation has been made by Kent and Netsell (1971) in 
a cinef luorographic study of stress contrast effects on various aspects of upper 
articulator movement in noun and verb forms of various words. They note that in 
words like "escort," articulator adjustments for the second vowel occur at the 
same time relative to the intervocalic consonant adjustments regardless of 
lexical stress. Our observations show that the EMG signals underlying the artic- 
ulator movement are organic in a way that parallels the output articulator move- 
ment* 

A second proposal for the stress contrast mechanism is Ohman^s (1967) sug- 
gestion that stress is manifested by "extra energy" of articulation of the 
stressed member of a contrasting pair. The results described above are consonant 
with such a description. However, Ohman's proposal is obviously incomplete. If 
all muscles reacted to extra stress with more vigorous activity, the effects on 
antagonistic muscles would cancel each other. There must be uneven effects of 
stress on various muscles. 

The left side of Figure 4 shows some examples from another subject. As 
before, genioglossus activity clearly lengthens with vowel lengthening as stress 
changes. There is also somewhat more activity with stress. However, the differ- 
ence is not huge. The right side shows sample records taken simultaneously from 
the geniohyoid muscle. The activity appears to be correlated with jaw opening, 
at least in part. The important point is that stress effects here are -^'ery much 
larger. Apparently, then, the effects of stress are not evenly distributed to 
all muscles. Perhaps the effects of stress change are greater on jaw muscles 
than on tongue muscles; I should say, however, that we have a poor understanding 
of the opening and closing movements of the jaw. 

What about Lindblom's (1963) hypothesis about the homogeneity of all dura- 
tion change mechanisms? He discusses only stress and speaking rate contrasts 
specifically. A third type of vowel duration change is the well-known effect of 
voicing status in the terminal consonant. We have scattered information on all 
three problems. Gay, Ushijima, Hirose, and Cooper (1973) have shown that the 
effects of speaking rate change are not uniform on all articulatory components. 
Some gestures show more forceful articulation when rate increases, while others 
show less activity. 

Raphael (1970) has collected some data on terminal consonant effects on 
vowel duration. The data are similar to what we have just seen — that is, the 
vowel gesture lengthens before the voiced consonant, but the timing relationship 
between consonant and vowel is fixed. 

Let me conclude by summarizing. First, interrelationship of the consonant 
and vowel is surprisingly constant over stress lengthening, a finding not pre- 
dicted by a previous model of the process. Second, stress effects can be consid- 
ered as "more energetic" enunciation, but the effects of increased energy are 
distributed unevenly, according to a pattern we do not now understand, over the 
relevant articulatory muscles. Third, the mechanism of duration change is not 
uniform for all suprasegmfental manipulations. 
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Figure 4 
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Parallel Processing of Auditory and Phonetic Information in Speech Perception* 
Charles C. Wood"^ 



Recent experiments using a variety of techniques have suggested 
that speech perception involves separate auditory and phonetic levels 
of processing. Two models of auditory and phonetic processing appear 
to be consistent with existing data: a) a strict serial model in 
which auditory information would be processed at one level followed 
by the processing of phonetic information at a subsequent Ic* el; and 
b) a parallel model in which at least some portion of auditory and 
phonetic processing could proceed simultaneously. The present ex- 
periment attempted to distinguish empirically between these two 
models. Subjects identified either an auditory dimension (fundamen- 
tal frequency) or a phonetic dimension (place of articulation of the 
consonant) of synthetic consonant-vowel syllables. When the two di- 
mensions varied in a completely correlated manner, reaction times 
were significantly shorter than when either dimension varied alone. 
This "redundancy gain" could not be attributed to speed-accuracy 
trades, selective serial processing, or differential transfer between 
conditions. These results are consistent only with a model in which 
auditory and phonetic information can be processed in parallel. 

Auditory and Phonetic Levels of Processing in Speech Perception 

Current theories of speech perception generally view the process by which 
linguistic information is extracted from an acoustic speech signal as a hier- 
archy of logically distinct levels or stages (see, for example. Fry, 1956; Fant, 
1967; Stevens and Halle, 1967; Stevens and House, 1972; Studdert-Kennedy, in 
press) . Recently we began a series of experiments designed to investigate pos- 
sible levels of processing involved in the perception of isolated consonant- 



*Expanded version of a paper presented at the 86th meeting of the Acoustical 
Society of America, Los Angeles, Calif., 30 October 1973. 

"^"Neuropsychology Laboratory, Veterans Administration Hospital, West Haven, Conn., 
and Yale University, New Haven, Conn. Present address: Division of Neuropsy- 
chiatry, Walter Reed Army Institute of Research, Washington, D. C. 20012. 
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vowel (CV) syllables (Day and Wood, 1972a, 1972b; Wood, 1973). The results of 
these experiments, together with those using other procedures (cf . Studdert- 
Kennedy and Shankweiler, 1970; Studdert-Kennedy , Shankweiler, and Pisoni, 1972; 
Tash and Pisoni, 1973), provide empirical support for a basic distinction between 
two levels of processing in phonetic perception: 1) an auditory level, in which 
an acoustic speech signal is analyzed into a set of corresponding auditory param- 
eters; and 2) a phonetic level, in which abstract phonetic features are extracted 
from the results of the preliminary auditory analysis. 

The basic paradigm used in these experiments was a two-choice speeded clas- 
sification task similar to that employed by Gamer and Felfoldy (1970) to study 
patterns of interaction between stimulus dimensions. Subjects were presented a 
sequence of synthetic CV syllables which varied between two levels of a given 
dimension and were required to identify which level of that dimension occurred 
on each trial.-'- Reaction time (RT) for the identification of each dimension was 
measured under two conditions: a) a single-dimension control condition in which 
only the target dimension to be identified varied in the stimulus sequence, and 
b) a two-dimension orthogonal condition in which both the target dimension and 
the irrelevant non target dimension varied orthogonally. For a given target 
dimension the only difference between the control and orthogonal conditions was 
the presence or absence of irrelevant variation in the nontarget dimension. 
Therefore, a comparison of the RTs from these two conditions indicates the degree 
to which each dimension may be processed independently of irrelevant variation in 
the other dimension. 

The initial experiments in this series analyzed the interactions between the 
following pairs of dimensions: a) an auditory and a phonetic dimension, b) two 
auditory dimensions, and c) two phonetic dimensions. Day and Wood (1972a) and 
Wood (1973, Experiment 1) compared the auditory dimension — fundamental frequency — 
with the phonetic dimension — place of articulation of voiced stop consonants. 
For convenience, these dimensions will be referred to as Pitch and Place, respec- 
tively. In both experiments irrelevant variation in Pitch significantly inter- 
fered with the processing of Place, but irrelevant variation in Place produced 
minimal interference with the processing of Pitch. Thus, the processing of the 
auditory dimension appeared to be independent of the phonetic dimension but not 
the reverse. 

A different pattern of results was obtained in experiments where the two 
dimensions were from the same class — both auditory or both phonetic. Wood (1973, 
Experiment 2) compared the same levels on the Pitch dimension used in the initial 
experiments with another auditory dimension, overall Intensity. Neither of these 
auditory dimensions could be processed independently of the other; that is, 
irrelevant variation in either dimension produced substantial interference with 
the identification of the other. Finally, for the case of two phonetic dimen- 
sions Day and Wood (1972b) compared Place with formant positions of the vowels 
in the CV syllables. The pattern of interaction between these phonetic dimen- 
sions was identical to that obtained for two auditory dimensions: neither phonet- 
ic dimension could be processed independently of irrelevant variation in the other. 



ERIC 



The term "dimension" is used in this paper to refer to aspects or properties of 
stimuli which are varied in a .given experiment. It should be emphasized that 
this term explicitly does not imply that the stimulus property in question is 
singular or unitary in a perceptual sense. The latter is an empirical question 
(cf. Gamer, 1970, 1973, in press; Garner and Felfoldy, 1970). 
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These results may be summarized in the following way. When both dimensions 
were members of the same class — both auditory or both phonetic — the interaction 
between them in this paradigm was a mutual or symmetric interaction. This 
result is typical of that obtained for "integral" stimulus dimensions in the 
experiments of Garner and Felfoldy (1970), and is what would be expected if the 
two dimensions were extracted by a single perceptual process or by multiple pro- 
cesses that are strongly dependent upon each other. In contrast, the interac- 
tion between a phonetic dimension and an auditory dimension was a unidirectional 
or asymmetric interaction. This kind of interaction is evidence a) that the 
two dimensions are not extracted by a single perceptual process, and b) that the 
component processes for the phonetic dimension are in some way dependent upon 
those for the auditory dimension. 

Serial Versus Parallel Organization of Auditory and Phonetic Levels 

Two process models appear to be consistent with the unidirectional interac- 
tion between Place and Pitch; a) a strict serial or sequential model in which 
auditory information would be processed at one level followed by the processing 
of phonetic information at a subsequent level, and b) a parallel model in which 
at least some portion of auditory and phonetic processing could proceed simul- 
taneously. 2 Various forms of both serial and parallel processing have been in- 
corporated in most theories of speech perception (cf. Stevens and House, 1972; 
St udder t-Kennedy, in press). 

A serial organization of auditory and phonetic levels could account for the 
unidirectional interaction between Place and Pitch by the position of each com- 
ponent process in the sequence. According to this model, a response could be 
made based on the output of the auditory level alone without processing by the 
subsequent phonetic level, while access to the phonetic level could occur only 
after prior auditory processii.g. In contrast, a parallel model could not pro- 
vide such a direct explanation for the unidirectional interference between Place 
and Pitch. However, the parallel model would be more consistent than a serial 
model with the finding of the initial experiments that RTs for Place and Pitch 
were hot significantly different in the control conditions. ^ 



Nickerson (1971) pointed out an important limitation of the terms "simultaneous" 
and "parallel" as used in current information-processing experiments: "What may 
appear to be simultaneous activities at one level of analysis may prove to be 
the result of an efficient switching process when the analysis is carried to a 
more microtemporal level" (p. 276). Following Nickerson's analysis, the terms 
simultaneous and parallel will be used in the present paper to indicate: "... 
that the processes in question are proceeding concurrently relative to the time 
scale on which they are measured , which is to admit the possibility of inter- 
mittent switching of attention between one process and another on a more micro- 
temporal scale" (p. 277). 

3 

According to a strict serial model any task which requires a response based on 
information about Place would also require prior processing of information about 
Pitch, therefore resulting in longer processing times for Place than for Pitch 
(for data which seem to satisfy such a model, see Posner and Mitchell, 1967; 
Posner, 1969). As described above, RTs in the orthogonal condition of the 
initial experiments were indeed longer for Place than for Pitch. However, 
(continued next page) 
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The purpose of the present experiment was to investigate more directly the 
degree to which the processing of auditory and phonetic dimensions occurs 
serially or in parallel. An efficient way of distinguishing between serial and 
parallel processing in the context of the RT experiments described above is to 
include, in addition to the control and orthogonal conditions, a third condition 
in which the two dimensions are completely correlated. In such a correlated 
condition both the target dimension and the redundant nontarget dimension pro- 
vide sufficient information for a correct response* The outcome of interest is 
simply whether the additional information provided by the' redundant dimension 
can be used by the subject to facilitate performance (i.e., to decrease RT rela- 
tive to the control conditions without sacrificing accuracy) 

A strict serial model would predict no decrease in RT in the correlated 
condition, since according to this model the Pitch dimension would always have 
to be processed first, followed by the processing of Place at a subsequent level. 
In contrast, given certain reasonable assumptions, the parallel model would pre- 
dict a decrease in RT (usually referred to as a "redundancy gain") in the corre- 
lated conditions relative to the control conditions for both dimensions. Accord- 
ing to this model, the processing of both dimensions would be initiated simul- 
taneously and the response would be based on the component process which was com- 
pleted first. The decrease in RTs in the correlated conditions would therefore 
occur statistically from the fact that the distribution that results from select- 
ing the faster of two "competing" distributions is faster than either "competing" 
distribution alone. ^ 



(continued) 

according to a serial model this result should be true for the control condition 
as well, since in both conditions identification of Pitch would require only one 
level of processing, while identification of Place would involve processing at 
both levels. The control condition RTs were in fact slightly longer for Place 
than for Pitch in both of the initial experiments (mean differences of 10.0 and 
3.4 msec), although in neither case was the difference statistically signifi- 
cant. Thus, the existing data are equivocal about the serial-parallel distinc- 
tion. These results suggest either a) that processing of auditory and pho- 
netic information can occur in parallel, or b) that processing is serial but 
the extra processing time required for the phonetic level is extremely small 
relative to the variability of the RTs in the initial experiments. 

^Garner (1970; see also Garner and FelfoLly, 1970) has argued that the distinc- 
tion between "integral" and "separable" stimulus dimensions must logically pre- 
cede any consideration of serial versus parallel processing, since the serial- 
parallel question is inappropriate for pairs of integral dimensions. The uni- 
directional interference between Place and Pitch in the initial experiments is 
clear evidence that these dimensions are not completely integral, and therefore 
that the serial-parallel question is appropriate. 

For a detailed analysis of the assumptions and predictions of various serial 
and parallel models in this and other paradigms, see Egeth (1966); Smith (1968); 
Hawkins (1969) ; Biederman and Checkosky (1970); Egeth, Jonides, and Wall (1972); 
Grill (1971); Lockhead (1972); Nickerson (1971); Townsend (1971); Saraga and 
(continued next page) 
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METHOD 



Subjects 

Each of six subjects (five volunteers and the experimenter) served in an 
experimental session lasting approximately two hours. 

Stimuli 

The acoustic stimuli were the four synthetic CV syllables /bae/-104 Hz, 
/bae/-140 Hz, /gae/-104 Hz, and /gae/-140 Hz, corresponding to the two levels 
on the two dimensions Place and Pitch. These two-formant stimuli were generated 
by the Haskins Laboratories parallel resonance synthesizer and were prepared to 
be equal in all acoustic parameters other than the two dimensions explicitly 
varied for experimental purposes. All four stimuli had identical fundamental 
frequency contours (falling), intensity contours (falling), duration (300 msec), 
and formant frequencies appropriate for the vowel /ae/. Pairs of stimuli dif- 
fering on the Place dimension differed in the direction and extent of the second 
formant transition (Liberman, Delattre, Cooper, and Gerstman, 1954; Delattre, 
Liberman, and Cooper, 1955), while pairs of stimuli differing on the Pitch dimen- 
tion differed in fundamental frequency (initial fundamental frequencies of 104 Hz 
versus 140 Hz). Spectrograms of the four stimuli are shown in Figure 1. 

Identification Task, Dimensions, and Conditions 

Subjects listened to blocks of trials in which the Place dimension, the 
Pitch dimension, or both dimensions could vary within a block. One dimension 
was specified as the target dimension for each block of trials and subjects were 
required to indicate as rapidly as possible which of the two levels on that tar- 
get dimension occurred on each trial. Subjects made their responses by pressing 
one of two response buttons with either the index or middle finger of the pre- 
ferred hand. Each button was assigned to the same level on each dimension 
throughout the experiment. 

Place and Pitch were each specified as the target dimension in three dif- 
ferent conditions, with each condition presented in a separate block of 64 trials. 
For the target dimension these three conditions were identical; that is, in all 
three conditions the target dimension varied randomly between its two levels and 
subjects had to identify which level on that dimension occurred on each trial. 
The only difference between conditions was the status of the nontarget dimension. 
In the control condition, the nontarget dimension was held constant at one of 
its two levels throughout the entire block of trials. For half the subjects the 
nontarget dimension was held constant at one level and for the remaining half it 
was held constant at the other level. In the orthogonal condition, the target 
dimension again varied randomly but in this case the nontarget dimension varied 
orthogonally. Thus all four stimuli occurred randomly in the orthogonal condi- 
tion. The control and orthogonal conditions were therefore identical to those 



(continued) 

Shallice (1973). Biederman and Checkosky (1970) and Lockhead (1972) present 
particularly clear discussions of the statistical rationale for the prediction 
of a parallel model that RTs in the correlation condition should be faster than 
those of the single-dimension control conditions. 
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300 MSEC 




/bcBAI04 Hz /gcB/-l04 Hz 




*- 1 




/bcB/-l40 Hz /gaeAI40 Hz 



Figure 1: Spectrograms of the four synthetic syllables. Stimuli differing in 
Place (/bae / versus /g® /) differed in the direction and extent of 
the F2 transition (left versus right half of the figure), while 
stimuli differing in Pitch (104 Hz versus 140 Hz) differed in funda- 
mental frequency (upper versus lower half) . The four stimuli were 
identical in all acoustic parameters other than Place and Pitch. 
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employed in the previous RT experiments described above (Day and Wood, 1972a, 
1972b; Wood, 1973). Finally, in the correlated condition both dimensions again 
varied but in this case were completely correlated. The stimuli for this condi- 
tion were /bae/-104 Hz and /gae/-140 Hz, regardless of which dimension was 
specified as the target dimension. 

Apparatus 

The stimuli were presented binaurally through Koss Pro-4AA earphones from a 
Precision Instrument FM tape recorder through computer-controlled relays and a 
Grason-Stadler electronic switch. A 64-item series of each of the four stimuli 
was recorded on a separate channel of the stimulus tape, synchronized to 60 ysec 
accuracy under computer control. On a given trial any of the four possible 
stimuli could be presented to the subject by closing a relay between the appro- 
priate stimulus channel and the subject's earphones. For each block of trials 
a LINC computer read a pseudorandom sequence of predetermined stimulus codes 
from paper tape and closed the appropriate relays in that specified sequence. In 
this way the same four-channel stimulus tape was used for all six conditions. 

The subject's identification responses and RTs were recorded to the nearest 
msec by an external clock (Beckman-Berkley Universal Counter-Timer) . The clock 
was triggered simultaneously with stimulus onset and was halted by the subject's 
response. Following the response on each trial the LINC read the RT from the 
counter and punched the stimulus code, response, and the RT for that trial on 
paper tape for later analysis. 

Procedure 

Each subject received one block of 64 trials in each of the six conditions 
(two dimensions x three conditions per dimension) in an order specified by a 
balanced latin square. Over the group of six subjects each of the six conditions 
occurred once in each sequential position and preceded and followed every other 
condition once. 

At the beginning of the experimental session, each subject was informed of 
the general nature of the experiment, the stimuli and dimensions to be presented, 
and the identification tasks to be required. Both speed and accuracy were 
strongly emphasized in all conditions. For the control condition subjects were 
instructed that only the target dimension would vary in that blocks of trials, 
and on each trial they were to identify which level on that dimension had 
occurred. For the orthogonal condition they were told that one dimension would 
be the target dimension but that the other dimension would also vary. They were 
instructed to identify the target dimension and to ignore variations in the 
irrelevant nontarget dimension. Finally, in the correlated condition subjects 
were told that the two dimensions would be completely correlated, with the levels 
on both dimensions specifying the same response. In this condition they were 
again instructed to identify the target dimension as rapidly and accurately as 
possible, and to use the extra information provided by the redundant dimension 
if possible. 

Prior to the block of trials for each condition, subjects received at 
least eight practice trials under conditions identical to those they would re- 
ceive in the following block. These practice trials were designed to stabilize 
RT performance and allow subjects to become familiar with the stimulus set and 
identification task they would received in that block. 
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Data Aiialysls 



As in the initial experiments, subjects made very few errors, averaging less 
than 3 percent over the entire experiment. Since there were no significant dif- 
ferences in the number of errors among any of the six conditions (Mann-Whitney U 
tests), error scores will not be considered in detail in the analysis below. 

For statistical analysis of the RT data, a complete four-way factorial anal- 
• ysis of variance was computed (Subjects x Conditions x Dimensions x Within Cell). 
The data entered into this analysis were the untransformed RTs with the single 
exception that all values greater than 1 sec were set equal to 1 sec. This pro- 
cedure eliminated the few very long RTs due to subjects' failure to press the 
response buttons sufficiently to make electrical contact, etc. Subsequent in- 
dividual comparisons among main effect and interaction means were made according 
to the Schef fe procedure (Winer, 1962) . 

RF^ ULTS 

Before examining the correlated conditions for evidence of serial or paral- 
lel processing, it is important to establish that the results obtained in the 
control and orthogonal conditions of the present experiment were similar to 
those in the corresponding conditions of the initial experiments. These data 
are presented in Table 1. For Place, there was an increase in RT of 57.3 msec 



TABLE 1: Mean reaction time (in msec) for each dimension and condition. 



Dimension 



Condition 



Place 
Pitch 



Control 
386.8 
381.4 



Correlated 
342.8 
346.1 



Orthogonal 
444.1 
385.2 



(Note: According to the Schef fe method for individual comparisons, a differ- 
ence between any pair of means 29.7 msec is significant at £ < .001.) 



from the control to the orthogonal condition, while the increase between condi- 
tions for Pitch was only 3.8 msec. In the analysis of variance, the effects of 
Conditions, Dimensions, and the Condition x Dimension interaction were signifi- 
cant, F (2,126) = 154.3, £ < .001; F(l,63) = 53.73, £ < .001; and F(2,126) = 
26.12, £ < .001, respectively. According to the results of the Schef fe analysis 
on the Condition x Dimension interaction means in Table 1, a difference _> 29.7 
usee was significant at £ < .001. Thus, these data are consistent with the uni- 
directional interference between Place and Pitch obtained in the initial experi- 
ments, and therefore indicate that analysis of the correlated conditions for 
evidence of parallel processing is appropriate for these dimensions. 

In the correlated conditions there were substantial decreases in RT for 
both dimensions (Table 1): 43.9 msec for Place and 35.3 msec for Pitch 
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(£ < .001). These significant redundancy gains are clearly in accord with the 
predictions of a parallel model. 

However, several alternative explanations for the redundancy gain must be 
ruled out before concluding that Place and Pitch were actually processed in 
parallel. First, as mentioned above there were no significant differences in 
errors among any conditions of the experiment, therefore eliminating the possi- 
bility that speed-accuracy trades could be responsible for the observed reduction 
in RT in the correlated conditions. 

A second way in which a redundancy gain could be obtained in absence of 
parallel processing is the strategy of "selective serial processing" (SSP, cf . 
Gamer, 1969; Morton, 1969; Biederman and Checkosky, 1970; Garner and Felfoldy, 
1970; Felfoldy and Garner, 1971) ♦ According to the SSP strategy the underlying 
mode of processing is strictly serial but subjects are presumed to have the 
ability to select which dimension they actually process in tasks with redundant 
dimensions. An apparent redundancy gain could therefore be produced if each 
subject performed the correlated task based on the faster of the two single 
dimensions, regardless of which was specified as target dimension by the instruc- 
tions . 

In an explicit analysis of the SSP strategy, Felfoldy and Garner (1971) 
suggested that either of the following two conditions must be met in order for 
SSP to be effective: a) that RTs for one dimension are significantly faster 
than the other across all subjects (as would be the case if the two dimensions 
differed greatly in discriminability) ; or b) that each subject is able to identi- 
fy one dimension significantly faster than the other, but the faster dimension is 
different for different subjects. In either case the SSP strategy could produce 
decreased RTs in the correlated condition relative to the control condition, but 
without true parallel processing* The first of these two conditions for SSP was 
clearly not met in the present experiment, since mean RT for Place and Pitch in 
the control conditions was not significantly different (Table 1) . However, there 
was a significant interaction of Subjects x Dimensions, F(5,315) = 25.24, 
£ < .001, indicating that there were reliable differences in RTs between the two 
dimensions for individual subjects. Therefore, it is logically possible that SSP 
could have produced the redundancy gain. 

Whether or not subjects actually used the SSP strategy can be evaluated 
directly by comparing the RTs from the correlated conditions to RTs from the 
faster of the two control conditions for each subject (Garner, 1969; Morton, 
1969; Biederman and Checkosky, 1970; Garner and Felfoldy, 1970; Felfoldy and 
Gamer, 1971) . Optimal performance for each subject under the SSP model would 
result in the RT of the correlated conditions being equal to the RT of that sub- 
ject's faster control condition. In contrast, the parallel model predicts that 
RTs in the correlated conditions will be faster than the control, even after 
correcting for the possible use of SSP. In the present experiment the mean of 
the faster control RTs for each subject was 369.4 msec. In contrast, mean RTs 
in the correlated conditions were 342.8 msec for Place and 346.1 msec for Pitch 
(Table 1). A separate analysis of variance and subsequent individual comparisons 
among these means (Scheffe method) showed that RTs in both correlated conditions 
were significantly faster than each subject's faster control RTs, F(2,126) = 
14.05, £ < .001. Therefore, the obtained redundancy gain cannot be attributed to 
the SSP strategy. 
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A third source of an apparent redundancy gain in absence of paralJpl pro- 
cessing is the possibility of differential transfer between control and corre- 
lated conditions. As pointed out by Biederman and Checkosky (1970),, greater 
positive transfer between the two correlated conditions than between the two con- 
trol conditions might tend to reduce artifically the RTs in the correlated condi- 
tion relative to the control condition. This would be true both for parallel 
processing and for SSP, since in either case the two correlated conditions would 
actually be repetitions of the same task while the two control conditions would 
always be different. To examine this possibility of differential transfer, the 
control conditions received first and second in sequence by each subject were 
compared to the correlated conditions received first and second by each subject, 
without regard to the actual target dimension in each condition. For the control 
conditions mean RT for the second block was 1.3 msec faster than the first, while 
for the correlated conditions the second block was 2.9 msec slower. Thus, there 
was minimal transfer between the two blocks of trials within each condition, and 
the direction of the obtained differences favored the control and not the corre- 
lated conditions. 

Finally, the question of explicit instructions to the subjects regarding the 
correlation between dimensions should be considered. It might be argued that 
such instructions could bias subjects toward responding faster in the correlated 
conditions, leading to a false conclusion about a redundancy gain. In their 
analysis of the conditions in which the SSP strategy is effective, Felfoldy and 
Gamer (1971) compared the effects of explicit instructions about the correlation 
between dimensions with implicit instructions consisting of exposure to only the 
control and correlated stimulus sequences. Using stimulus dimensions which do 
not produce a redundancy gain under conditions of neutral instructions (Garner 
and Felfoldy, 1970), Felfoldy and Gamer (1971) showed that the explicit instruc- 
tions produced extensive use of SSP by all subjects, but no evidence of a redun- 
dancy gain beyond that attributable to SSP alone. That is, RTs in the correlated 
conditions were significantly faster than in the control conditions, but were not 
significantly different from each subject's faster control condition. 

The explicit instructions were therefore used in the present experiment to 
maximize the ability to discriminate between a redundancy gain due to parallel 
processing and one produced by SSP. Indeed, a result demonstrating selective 
serial processing of auditory and phonetic information would have implications 
for models of speech perception as important as those of parallel processing. 
The correction for SSP described above rules out possible biased reduction of RTs 
in the correlated condition per se since it demonstrates that correlated RTs were 
faster than would be possible according to a serial model. However, this correc- 
tion does not completely eliminate the possibility that the explicit instructions 
produced artifically inflated RTs in the control conditions, thereby producing 
faster RTs in the correlated conditions than in the control. This possibility 
is unlikely since: a) the control RTs for both dimensions in the present experi- 
ment averaged 25 msec faster than those of extremely well practiced subjects in 
the experiment of Wood (1973, Experiment 1) under similar conditions, and b) the 
differences in RT between control and orthogonal conditions in this experiment 
were virtually identical to those of Wood (1973, Experiment 1). Thus, if sub- 
jects inflated the control RTs they would have also had to inflate the orthogonal 
RTs by a precisely equal amount. These observations make the biased inflation of 
the control RTs extremely unlikely. 
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DISCUSSION 



The results of this experiment have implications both for models of speech 
perception in a narrow sense, and for the broader question of how multidimension- 
al stimuli of any kind are perceived. The relation of the present results to 
both these problem areas is discussed below. 

Relation to Multidimensional Information Processing 

The way in which humans process multidimensional stimuli has been the sub- 
ject of considerable theoretical and experimental effort (cf. Garner, 1962, 1970, 
1973, in press; Posner, 1964; Egeth, 1966, 1967; Lockhead, 1966, 1970, 1972; 
Egeth and Pachella, 1969; Nickerson, 1971). The large number of experiments de- 
voted to this question can be classified into two distinct approaches or patterns 
of major emphasis: a) those which emphasize stimulus concepts and focus upon the 
nature of the stimulus dimensions which make up the multidimensional stimuli, and 
b) those which emphasize processing concepts and focus upon the nature of the 
processes by which the multidimensional stimuli are perceived. 

Major emphasis upon stimulus concepts has come from the distinction between 
integral and separable stimulus dimensions (Gamer and Felfoldy, 1970; Gamer, 
1970, 1973, in press). Expanding upon previous suggestions by Torgerson (1958), 
Attneave (1962), Shepard (1964), Lockhead (1966), and Hyman and Well (1968), 
Gamer and Felfoldy (1970) argued that the concept of integral dimensions could 
best be defined by converging experimental operations (Gamer, Hake, and Eriksen, 
1956): "Integral dimensions are those which lead to a Euclidean metric in direct 
distance scaling, produce a redundancy gain when the dimensions are correlated 
and some measure of speed of classification is used, and produce interference in 
speed of classification when selective attention is required with orthogonal stim- 
ulus dimensions" (p. 238). Added to this list of converging operations are the 
recent data of Handel and Imai (1972) which distinguish between integral and 
separable dimensions using free classification tasks. Since the present experi- 
ment employed two of the converging operations listed above, conditions directly 
analogous to those of Garner and Felfoldy (1970), the present results may be com- 
pared to those expected of integral and separable dimensions. 

The results for the Place dimension correspond exactly to those of one 
member of an integral pair of dimensions: a significant redundancy gain in the 
correlated condition and significant interference in the orthogonal condition. 
However, the results for the Pitch dimension showed a different pattern, one 
which is consistent with neither integral nor separable dimensions. In this case 
there was again a significant redundancy gain in the correlated condition, but 
there was minimal interference in the orthogonal condition. 

These results pose two problems for a strict distinction between integral 
and separable dimensions. First, the interference between Place and Pitch in 
the orthogonal conditions was unidirectional or asymmetric. Previously the in- 
tegral-separable distinction has always been considered to be symmetric, with 
either equal interference or no interference between a pair of dimensions. How- 
ever, as discussed in detail by Garner (1973), recent data suggest that integral- 
ity and separability may be more accurately considered as the two ends of a con- 
tinuum rather than as a strict dichotomy. The second problem for the integral- 
separable distinction posed by the present results is that the processing of the 
Place and Pitch dimensions was affected differentially by the correlated and 
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orthogonal conditions. That is, instead of strictly obeying the converging oper- 
ations notion of integrality and separability presented above. Place and Pitch 
produced one pattern of results in the correlated conditions and another pattern 
in the orthogonal conditions. These results indicate that subjects appear to 
have some option in the way they process Place and Pitch: a) they can process 
the two dimensions in parallel, as demonstrated by the redundancy gain for both 
dimensions in the correlated conditions; or b) they can process the Pitch dimen- 
sion selectively, as demonstrated by the minimal interference produced by irrela- 
vant variation in Place in the orthogonal condition. 

The processing options for Place and Pitch also pose problems for existing 
process models. A number of authors have attempted to develop a single serial 
or parallel model which could account for a wide variety of information-process-- 
ing tasks (see discussions by Egeth, 1966; Smith, 1968; Hawkins, 1969; Sternberg, 
1969; Biederman and Checkosky, 1970; Grill, 1971; Nickerson, 1971; Townsend, 
1971; Saraga and Shallice, 1973). Basic serial and parallel models have been 
modified to include distinctions between exhaustive and self-terminating pro- 
cesses, fixed or random orders of search, fixed or variable durations for each 
component process, and a number of others. In addition, Lockhead (1972) pre- 
sented a "blob" or holistic processing model in which a multidimensional stimu- 
lus is first processed holistically, with subsequent serial processes as required 
by the task. The present results are inconsistent with any model that specifies 
a particular mandatory process — serial, parallel, or holistic. These data pro- 
vide further evidence for the suggestions made by Grill (1971), Nickerson (1971), 
Townsend (1971) , and Gamer (in press) that no single stimulus distinction or 
process model may be universally appropriate. Rather, under different conditions 
subjects may employ various strategies, including serial processing, parallel 
processing, holistic processing, and combinations thereof, depending upon the 
constraints of the stimuli and perceptual tasks involved. Instead of performing 
"critical experiments" to establish the validity of various models, a more appro- 
priate strategy would appear to be the empirical distinction between mandatory 
and optional processes, and the investigation of stimulus properties and task 
conditions related to each (Grill, 1971; Nickerson, 1971; Townsend, 1971; Garner, 
in press) . 

Relation to Models of Speech Perception 

The idea that speech perception may involve some form of parallel processing 
has been suggested on logical grounds by a number of investigators (cf . Liberman, 
Cooper, Shankweiler, and Studdert-Kennedy, 1967; Stevens and House, 1972; 
Studdert-Kennedy, in press). The results of the present experiment provide clear 
evidence that auditory and phonetic information can be processed in parallel, and 
they provide a starting point for the investigation of possible parallel process- 
ing of other kinds of linguistic and nonlinguistic information in speech percep- 
tion. 

However, despite the strong evidence in the present experiment for parallel 
processing of auditory and phonetic dimensions, such a conclusion contradicts the 
intuitively reasonable idea that auditory and linguistic processes are organized 
serially, with linguistic processes dependent upon those performed by the general 
auditory system. This idea is clearly stated by Stevens and House (1972): "All 
acoustic signals undergo some common peripheral processing, and up to a certain 
point in the auditory system the nature of this early processing is the same 
whether the signal is speech or is not speech" (p. 9). The suggestion that lin- 
guistic processes are dependent upon the peripheral auditory system must certainly 
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be correct in some form, since all acoustic signals must be transduced by the 
receptor apparatus. Thus, the "common peripheral processing" for speech and non 
speech undoubtedly includes the receptor apparatus, and presumably includes much 
of the subcortical auditory system. However, the exact extent to which this 
"coimnon peripheral processing" extends anatomically into the auditory system re- 
mains to be determined, and the locus at which processing peculiar to speech is 
initiated continues to be one of the most interesting questions in speech percep 
tion. 

The general organization of auditory and phonetic processes that seems 
necessary to account for these observations would require at least three compo- 
nents: a) a common peripheral component for the transduction and preliminary 
analysis of all acoustic signals, b) a "central" auditory component for the 
additional processing of nonlinguistic auditory information, and c) a "central" 
phonetic component for the extraction of phonetic features from the results of 
the preliminary auditory analysis. The two central components would be capable 
of functioning in parallel, but both would be dependent upon the output of the 
prior peripheral processing. From this point of view, the common peripheral 
stage could not be directly manipulated in RT experiments, since it would be 
mandatory for all auditory processing tasks. As a working hypothesis this 
"hybrid" organization would be consistent with the parallel processing of audi- 
tory and phonetic information demonstrated by the present experiment, and with 
the previous experiments which suggest that the perception of speech involves 
specialized neural mechanisms not required for the perception of nonspeech 
(cf. Liberman et al., 1967; Studdert-Kennedy and Shankweiler, 1970; Mattingly, 
Liberman, Syrdal, and Halwes, 1971; Wood, Goff, and Day, 1971; Studdert-Kennedy 
et al., 1972; Wood, 1973). 
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Perception of Speech and Nonspeech, with Speech-Relevant and Speech-Irrelevant 
Transitions 

James E. Cutting^ 

Raskins Laboratories, New Haven, Conn. 



The course of speech processing, from registration of the acoustic 
signal to phonetic interpretation of that signal, may be represented 
as a number of operations engaging different mechanisms. Since speech 
perception is primarily confined to the left hemisphere of the brain, 
many of these operations may be assumed to take place there. The re- 
sults of the present study, together with results of previous studies, 
indicate that along with a general speech processor in the left hemi- 
sphere there is a purely auditory device suited to analyze transitions, 
whether those transitions are part of a speech signal or not. 

Differences in the perception of speech and nonspeech have been well docu- 
mented. Perhaps the best information about the processing of these two types of 
stimuli stems from dichotic listening. In general, speech tasks, whether identi- 
fication or temporal-order judgment, yield right-ear advantages (Kimura, 1961; 
Day and Cutting, 1971) . On the other hand, nonspeech tasks, whether identification 
or temporal-order judgment, generally yield left-ear advantages (Chaney and 
Webster, 1966; Day and Cutting, 1971). The results of a recent study (Cutting, 
1973) are in accord with these findings. In a temporal-order judgment task, 
speech stimuli yielded a large right-ear advantage and nonspeech stimuli yielded 
a small advantage to the left ear. Although the ear advantage for nonspeech 
stimuli was not significant, it was significantly different from the speech stim- 
uli. The nonspeech stimuli, however, were quite different from those used in 
many studies. They were sine-wave analogs to the speech stimuli: that is, three 
pure tones were synthesized to correspond to the middle frequency of each of the 
three formants of each of the synthetic speech stimuli. As in previous studies, 
processing appeared to be different for the two types of stimuli, each type of 
stimuli requiring different general amounts of processing in each hemisphere, 
even though the stimuli shared many acoustic characteristics. 

Since speech/nonspeech differences have been given so much attention in the 
past (Kimura, 1967; Semmes, 1968; Mattingly,^ Liberman, Syrdal, and Halwes, 1971; 
Day, in press), consider the other, and perhaps more interesting, general result 
of the Cutting study. Stimuli with transitions were perceived differently than 
stimuli without transitions regardless of whether they were speech or nonspeech: 
a stimulus which contained rapid frequency transitions required special process- 
ing in the left hemisphere, and this processing was not dependent on the stimulus 
having been classified as speech. 



Also Yale University, New Haven, Conn. 

[HASKINS LABORATORIES: Status Report on Speech Research SR-35/36 (1973)] 

55 



The processing capabilities of the left hemisphere appear to be fundamen- 
tally different from those of the right hemisphere. Most of these differences 
have been thought to be related to language, and range through all levels from 
phonetics to semantics. Since speech and language processing is primarily con- 
fined to one hemisphere, it would be advantageous to have certain subsidiary sys- 
tems in that hemisphere to assist in the demodulation of the incoming speech sig- 
nal. A subsystem which tracks rapidly changing formant frequencies would not 
typically be needed in a hemisphere geared for nonlinguistic analysis since such 
changing frequencies often do not appear to be pertinent to the perception of 
nonspeech sounds. The notion of an auditory analyzer residing only in the left 
hemisphere is in keeping with Semmes' (1968) views of hemisphere differences in 
processing capabilities. She states that left-hemisphere function is character- 
ized by "focal" organization, whereas the right hemisphere is characterized by a 
more "diffuse" organization. Certainly the analysis of rapid pitch modulations 
is a "focal" task requiring very precise detectors. 

It seems reasonable to suppose that the analysis of transitions is indepen- 
dent of whether the stimulus is speech or nonspeech. Otherwise, the system must 
necessarily make an early decision determining whether or not a stimulus is 
speech, before starting to analyze its transitions. Such a process would be un- 
necessarily cumbersome, if not untenable. Moreover, the notion of an independent 
transition analyzer in the left hemisphere is congruent with the results of 
Darwin (1971). He found that fricatives with formant transitions yielded a 
right-ear advantage, while the same fricatives without transitions yielded no 
advantage. 

Cutting (1973) employed two variables in his stimuli: they could be speech 
or nonspeech, with transitions or without transitions. Speech processing is pri- 
marily a left-hemisphere task, and it appears that the analysis of transitions is 
also a left-hemisphere task. The two processes appear to be independent of one 
another, and also additive. Consonant-vowel ( CV ) stimuli, both [+ speech] and 
[+ transition] , yielded a large right-ear/left-hemisphere advantage since the two 
variables favor left-hemisphere processing. - Steady-stare vowels (V) and sine- 
wave analogs of the CV stimuli ( CVsw) , however, had only one positive value on 
the two dimensions and thus yielded smaller ear difference scores: V stimuli 
were [+ speech] and [- transition] and CVsw stimuli were [- speech] and [+ transi- 
tion] . Sine-wave analogs of the vowel stimuli ( Vsw ) were both [~ speech] and 
[- transition], and consequently yielded a left-ear advantage. It should be 
noted that speech/nonspeech was a more potent dimension than transition/nontransi- 
tion: V stimuli, for example, yielded a larger right-ear advantage than did CVsw 
stimuli. 

These results, however, do not necessarily support the notion of an auditory 
transition analyzer in the left hemisphere. It remains possible that this second- 
ary mechanism is, in reality, a language-based subsystem. Perhaps, because of 
their resemblance to speech syllables. Cutting's CVsw stimuli triggered this pro- 
cessor into analyzing the transitions appropriate for the stop consonants. The 
present study was designed to test whether stimuli with transitions not corres- 
ponding to phoneme segments yield equal ear advantages to those stimuli that 
have transitions corresponding to phoneme segments. An affirmative result would 
support the notion that this device is an auditory feature detector, while a 
negative result would indicate that it is a linguistic feature detector. 




Method 



Stimuli , CV and CVsw stimuli were used again as in Cutting (1973). They 
were, or corresponded to, the CV syllables [bi, gi, bae, gae, bo, go]. Stimuli 
containing the same vowel, or sine waves which corresponded to the same vowel, 
(such as [bae, gae]) were identical in all respects except for the second- formant 
transition. In all cases, the second-f ormant transition of [b] rose for 50 msec 
to the resting frequency of the formant, while the transition of [g] fell to that 
frequency. First- and third-f ormant transitions vere both always upgliding. Two 
other sets of stimuli were synthesized. One set was similar to the CV stimuli in 
that its members contained formants and formant transitions typically found in 
speech. The particular array of formant transitions, however, could never have 
been produced by a human vocal tract. As in the CV stimuli the second-f ormant . 
transition could be either upgliding or downgliding in frequency. The first- and 
the third-f ormant transitions, on the other hand, were always downg liding. The 
extent of the two transitions was the same, only the direction was changed. 
Since these stimuli had transitions which did not correspond to any consonant 
speech segment they are designated C W stimuli. The fourth stimulus set con- 
sisted of sine-wave analogs of the C W stimuli, and are designated C^Vsw stimuli. 
Four stimuli, one from each class, are displayed in Figure 1. Thick bars indi- 
cate formants in the speech and speech- like stimuli, while narrow lines indicate 
sine v/aves in the nonspeech stimuli. The actual bandwidth of the formants was 
60, 90, and 120 Hz for the first, second, and third formants, respectively. The 
bandwidth of the sine waves was essentially zero. 

Subjects . Ten Yale University undergraduates participated in Task 1, and 
16 others participated in Task 2. All were right-handed native American English 
speakers with no history of hearing difficulty. None had previous experience 
listening to synthetic speech or to dichotic stimuli. 

Task 1 ; Diotic Identification 

A brief identification task was run to be sure that the sine-wave stimuli 
were not identifiable as speech. 

Procedure . Only CV and CVsw stimuli were selected to be identified. Sub- 
jects (S^s) listened to a tape of 120 items presented in random order one at a 
time: (2 classes of stimuli) x (6 stimuli per class) x (10 observations per 
stimulus) . They were instructed to write down their responses as BEE, DEE, GEE, 
BAA, DAA, GAA, BAW, DAW, or GAW for each item. In this manner S^s were forced to 
try labeling the sine-wave stimuli as speech. Note that there were, in fact, no 
stimuli which began with the phoneme [d]. No practice or training was given. 

Results . CV stimuli were correctly identified on 82 percent of all trials, 
while CVsw stimuli were correctly identified on only 19 percent of all trials, 
only a few percentage points above chance. Responses were parsed into their com- 
ponent segments, and performance was measured for both consonants and vowels. In 
the CV stimuli, consonants were correctly identified 82 percent of the time and 
vowels 97 percent of the time. The corresponding scores for the CVsw stimuli 
were 38 and 45 percent correct. 

These results strongly suggest that the formant stimuli were processed as 
speech and the sine-wave stimuli as nonspeech. Although S^s gave consistently 
correct identifications for the CV stimuli, they scattered their identifications 
over the nine possible responses for the CVsw stimuli. 
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Task 2 : Dlchotlc Temporal- Order Judgment 



A dichotlc listening task was devised to assess the role of transitions in 
the perception of speech and nonspeech signals. The paradigm was a temporal- 
order judgment task not requiring Ss to give verbal labels to the stimuli, and 
was identical to that of Cutting (1973). 

Tapes and procedure . Trials for the dichotic temporal-order judgment task 
were constructed so that ^s recognized the leading stimulus in a given dichotic 
pair by means of a subsequent probe stimulus. A trial consisted of a dichotic 
pair with a temporal onset asynchrony of 50 msec, followed by 1 sec of silence, 
followed by a diotic stimulus which was one of the members of the dichotic pair. 
Ss were instructed to regard the diotic stimulus as a probe which asked the 
question: "Is this the stimulus which began first?" Figure 2 shows a schematic 
representation of two such trials. Consider sample trial 1, where Stimulus 1 
begins before Stimulus 2 by 50 msec, and the probe stimulus is Stimulus 1. Since 
the probe is identical to the stimulus that began first, the correct response is 
yes . In sample trial 2, the dichotic pair is the same as in trial 1, but the 
probe stimulus is different. Since Stimulus 2 did not begin before Stimulus 1, 
the correct response for trial 2 is no. The 50 msec onset asynchrony was chosen 
because it is a particularly difficult interval at which to judge stimulus order 
(see Day, Cutting, and Copeland, 1971; Day and Cutting, 1971; Day and Vigorito, 
1973) . 

Four tapes were constructed, one for each class of stimuli. Each tape con- 
sisted of 48 trials: (6 possible pairs) x (2 channel arrangements) x (2 possi- 
ble probes) x (2 observations per pair) . The stimuli used in each trial were 
always selected from the same class of stimuli. CV trials, for example, were 
constructed out of CV stimuli that shared neither the same vowel nor the same 
consonant: thus, for example, /bi/ was paired with /gae/ or /go/. CVsw, C*V , 
and C^ Vsw trials were constructed using the same rules applied to CV stimuli. 
Stimuli in the dichotic pair were counterbalanced for leading and lagging posi- 
tion. The probe stimulus chosen for each trial and the channel assignments of 
the stimuli in the dichotic pair were also counterbalanced in the random sequence 
of trials. 

S^s listened to each tape twice, reversing the earphones after one pass 
through the tape. The order of channel assignments was counterbalanced across 
^s. Each group listened to the four tapes in a different order, determined by a 
balanced latin square design. S^s listened to a total of 384 trials consisting of 
a dichotic pair and a diotic probe, writing Y for yes or N for no for each trial. 
Four practice trials were given before each stimulus class in order to familiar- 
ize ^s with the stimuli. 

Results 

In general the task was quite difficult: overall performance for all trials 
and all types of stimuli was 60 percent correct. Performance for each of the 
four types of stimuli was comparable: the average score for each was between 59 
and 61 percent, with no significant differences among them. Because of the com- 
parability of performance levels no phi coefficient analysis was necessary (see 
Kuhn, in press) . 



59 



The pattern of ear advantages is quite interesting, but it is necessary 
first to note how the results were scored- Consider again the sample trials in 
Figure 2. The correct response for sample trial 1 is yes , while the correct re- 
sponse for trial 2 is no. If in the dichotic pair Stimulus 1 was presented to 
the right ear and Stimulus 2 to the left ear, and if the S^ responded yes for the 
first trial and no for the second, he would have been correct on both. This 
would be scored as two correct responses for the right-ear leading stimulus. On 
the other hand, if the S^ had responded no and yes , respectively, for the sample 
trials both would be wrong and his score for the right-ear leading stimulus 
would be docked for two incorrect responses, (Of course, if the channels had 
been reversed with Stimulus 1 presented to the left ear and Stimulus 2 to the 
right the logic would be entirely reversed*) 

CV and C^V trials . There was a large significant ear difference for the 
formant stimuli, ^s were 65 percent correct in responding to the probe stimulus 
when the leading stimulus was presented to the right ear, and 54 percent correct 
when it was presented to the left ear, yielding a net 11 percent difference 
[F(l,15) = 5.38, £ < .05]. There was no significant difference between CV and 
CW stimuli. 

CVsw and CWsw trials . No significant ear advantage was found for sine- 
wave stimuli. S^s were 61 and 60 percent correct for right-ear and left-ear lead- 
ing trials, respectively — a net 1 percent right-ear advantage. 

The results of each of the four conditions are shown in Figure 3. The 
effect of phonetic versus nonphonefcic transitions was not an important factor in 
either formant or sine-wave stimulus perception. 

Speech vs. nonspeech . Although the CW stimuli did not have phonetic transi- 
tions, their vowel portions were speech-like enough to make the stimuli sound 
like syllables with a garbled onset, good enough to rate the label "speech." 
Sine-wave stimuli, regardless of the nature of their transitions, always sounded 
like nonspeech. Accepting this redefinition of the speech/nonspeech distinction, 
there was a significant difference in the ear advantages for the two types of 
stimuli [F(l,15) = 16.41, £ < .001]. Moreover, the magnitude of the speech/non- 
speech difference in the present study was comparable to that found in Cutting* s 
(1973) previous study using this paradigm. 

Discussion 

The results of the present study suggest that the transition analyzer in the 
left hemisphere is a language- independent device. Stimuli with transitions 
which do not correspond to any phoneme segment yield results nearly identical to 
stimuli which have phonetically appropriate transitions. This result is conso- 
nant with that of Kimura and Folb (1968) , who found a right-ear advantage for 
the perception of speech played backwards. Backwards speech, like C^V stimuli, 
has transitions that are often inappropriate for the perception of any specific 
speech segments, but it is nevertheless heard and processed as speech. 

The existence of a purely auditory analyzer in the left hemisphere, in addi- 
tion to the usual speech processor, may be able to explain the variation in ear 
advantages found within specific phonemes. Indeed, Studdert- Kennedy and 
Shankweller (1970) reported differential ear advantages for various stop conso- 
nants: although the differences were not significant, [b, g, p, k] yielded 
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larger right--ear advantages than [d, t] in many different vowel contexts. In 
general, the labial [b, p] and the velar [g, k] stops have second-f ormant transi- 
tions of greater extent than do the alveolar stops [d, t] . Perhaps these sys- 
tematic differences in ear advantages stem from differential engagements of the 
auditory analyzer. A possible effect of transitions can be seen not only in the 
consonants but also in the vowels. Studdert-Kennedy and Shankweiler reported 
differential ear advantages for the several vowels that they used. Not surpris- 
ingly [i], the vowel which has the most extreme second- formant position and 
which typically undergoes much context-conditioned variation in CVC syllables, 
yielded the largest right-ear advantage. Furthermore, it had the only ear advan- 
tage for a vowel that was significant. 

Increased right-ear advantages appear to result from increased acoustic 
variation in both nonspeech and speech contexts. Apart from the results of the 
present study, Halperin, Nachshon, and Carmon (1973) found that the introduction 
of acoustic variation within nonspeech sounds could change a left-ear advantage 
into a right-ear advantage. Furthermore, the more the acoustic variation, the 
greater the tendency for the right-ear /left-hemisphere system to excel over its 
counterpart. 

The results of the present study, taken in conjunction with those of Cutting 
(1973), Darwin (1971), and Halperin et al. (1973), provide strong evidence that 
the left hemisphere is not only specialized for processing speech, but is spe- 
cialized for processing certain purely auditory events as well. It seems ecolog- 
ically parsimonious for man to have developed both systems within the same hemi- 
sphere of t' ^ brain. Certainly most speech utterances are characterized by a 
myriad of foimant transitions which must necessarily be analyzed before phonetic 
decisions can be made about them. That the appropriate auditory analyses and the 
phonetic decisions about them appear to occur in relatively close proximity is an 
example of an elegant way to build a system. 
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On the Identification of Place and Voicing Features in Synthetic Stop Consonants 
David B. Pisoni*^ and James R. Sawusch"^ 



Two models of the interaction of phonetic features in speech per- 
ception were used to predict subjects' identification functions for a 
bidimensional series of synthetic consonant-vowel syllables. The 
stimuli varied systematically in terms of the acoustic cues underlying 
the phonetic features of place of articulation and voicing. Model I 
assumed that phonetic features are additive and are processed inde- 
pendently in perception. Model II assumed that the phonetic features 
interact and are not processed independently. The fit of Model II to 
the bidimensional series data was better than the fit of Model I, sug- 
gesting that the phonetic features of place and voicing in stop conso- 
nants are not processed independently but rather show a mutual depen- 
dency . 

Theoretical accounts of speech sound perception have frequently proposed 
some type of articulatory-motor involvement during perceptual processing 
(Liberman, 1957; Stevens, 1960; Liberman, Cooper, Shankweiler, and Studdert- 
Kennedy, 1967; Stevens and Halle, 1967). One reason for this may be that re- 
search on speech sound perception has drawn its descriptive categories from the 
account of speech production offered by phoneticians. Thus, the articulatory 
dimensions that distinguished different classes of speech sounds in production 
served as the basis for uncovering the acoustic cues that distinguish different 
speech sounds in perception. Spectrographic analysis and perceptual experiments 
revealiid that the sounds of speech were not arrayed along a single complex di- 
mension but could be specified in terms of a few simple and independent dimen- 
sions (Gerstman, 1957; Liberman, 1957). Acoustic dimensions were found in early 
experiments with synthetic speech to provide distinctions in perception corre- 
sponding to the articulatory dimensions of speech production, suggesting that 
perceptual and articulatory dimensions of speech may be intimately linked 
(Delattre, 1951; Liberman, 1957). 
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Two artlculatory features to receive a great deal of attention in the de- 
scription of stop consonant production are place of articulation and voicing. 
Both these features have fairly well defined acoustic properties which presumably 
mirror the differences in production (Delattre, 1951). For example, the feature 
of place of production refers to the point of constriction in the vocal tract 
where closure occurs. The acoustic cues that underlie the place feature in con- 
sonant-vowel (CV) syllables are reflected in the formant transitions into the 
following vowel, particularly the direction and extent of the second and third 
formant transitions (Liberman et al., 1967). In contrast, the voicing feature is 
related to the presence or absence of periodic vibration of the vocal chords. The 
acoustic cues that underlie the voicing feature in stop consonants in initial 
position are reflected in terms of the relative onset of the first formant transi- 
tion (i.e., Fl "cutback") and the presence of aspiration in the higher formants 
(Liberman, Delattre, and Cooper, 1958). This compound acoustic cue has been 
called "voice-onset time" (VOT) by Lisker and Abramson (1964) and corresponds to 
the time interval between the release from stop closure and the onset of laryn- 
geal pulsing. 

Figure 1 presents schematized spectrographic patterns which show the acous- 
tic cues for place and voicing features for the CV syllables /ba/, /da/, /pa/, 
and /ta/. There is a relatively simple relation between articulatory features of 
place and voicing and their respective acoustic cues when the vowel is held con- 
stant (see also Liberman, 1970). Consonants within a particular row share voic- 
ing; /ba/ and /da/ are voiced, /pa/ and /ta/ are voiceless. The major acoustic 
cue for voicing in these syllables is the cutback or elimination of the initial 
portion of the first formant. Consonants within a particular column share place 
of production: /ba/ and /pa/ are bilabial stops, /da/ and /ta/ are alveolar 
stops. The primary acoustic cue for place is the direction and extent of the 
second and third formant transitions. 

Several perceptual experiments employing stop consonant-vowel syllables have 
concluded that the features of place and voicing are processed independently of 
each other. For example. Miller and Nicely (1955) analyzed the perceptual confu- 
sions among 16 CV syllables presented to listeners under various signal-to-noise 
ratios and filtering conditions. They computed the sum of the information trans- 
mitted by the features separately and in combination. Since the two values were 
approximately equal, they concluded that the features used in their analysis were 
mutually independent ♦ Among these features were place and voicing. As a part of 
a larger investigation of dichotic listening, Studdert-Kennedy and Shankweiler 
(1970) reached the same conclusion by a similar analysis of place and voicing 
confusions among stop consonants. 

These studies imply that features are extracted separately durxng early per- 
ceptual processing and are later recombined in response. Figure 2 represents a 
simplified block diagram of this process. The output of the auditory analysis is 
a set of acoustic cues {c^}. These cues are combined and from them a set of pho- 
netic features {fj} is recognized. Finally, the phonetic features are combined 
to yield the perception of the phonetic segment. Together, stages 2 and 3 form 
what Studdert-Kennedy (1973) has described as the "phonetic" stage of processing. 
We assume that phonetic features are recognized or identified in short-term mem- 
ory (STM) when the auditory patterns derived from the acoustic cues have made con- 
tact with some representation generated from synthesis rules residing in long-term 
memory. We assume that abstract phonetic features have an articulatory rather 
than acoustic reality in STM although we will not try to justify this assumption 
at present. 
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Figure 1: A schematized sound spectrogram of the syllables /ba/, /da/, /pa/, 
and /ta/ as used in the present experiment. 
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To say that phonetic features are independent implies independence of pro- 
cessing at all three stages of Figure 2. That is, the acoustic cues are extracted 
from the acoustic waveform separately and independently of each other in stage 1. 
Then the phonetic features are extracted separately and independently from the 
acoustic cues in stage 2. Finally, the phonetic features are combined separately 
and independently of each other in stage 3, resulting in a particular phonetic 
segment. 

The independence of phonetic feature processing in stage 3 may be described 
quantitatively by a simple linear or additive model. The phonetic features of 
place (fj) and voicing (f2) that are output from stage 2 are weighted separately 
and then added together in stage 3. Equation 1 expresses this concept algebra- 
ically: 



X = a^f ^ + a2f2 (1) 

Here, fi is the amount of the place feature of stimulus X output from stage 
2, and ai is its associated weight. Similarly, f2 is the amount of the voicing 
feature of X output from stage 2, and a2 is its associated weight. Since these 
two features are sufficient to distinguish among the four stops b, d, p, and t, 
we will ignore other phonetic features as being redundant and nondistinctive . 

However, evidence for nonindependence of phonetic features, in particular 
the features of place and voicing, has also been presented by several investiga- 
tors. This nonindependence could come at any of the three levels mentioned. For 
example. Haggard (1970) put the dependence relationship of the features of place 
and voicing in the second stage, where phonetic features are extracted. In the 
model of Haggard (1970), the listener's decision on the voicing feature is partly 
determined by his prior decision on the place feature. 



Lisker and Abramson (1964, 1967) reached the corresponding conclusion upon 
examination of their production data for stop consonants in initial position. 
The voicing feature as reflected in VOT depends on the feature of place of pro- 
duction; the VOT lag at the boundary between voiced and voiceless stops increases 
as place of production moves further back in the vocal tract (i.e., from /ba/ to 
/da/ to /ga/). Given the anatomical and physiological constraints on speech pro- 
duction, this position is a priori more plausible. 

This particular concept of nonindependence, where the feature of voicing 
partly depends on the feature of place, may also be expressed algebraically. We 
will again assume independence of processing in stages 1 and 2. The nonindepen- 
dence in stage 3 may be expressed as: 



X - a^f^ + a2f2 + b(l-f^)f2 (2) 

Here, aj^, fj^, a2» and f2 are the same as in equation 1. However, the con- 
stant b represents the weight given to the interaction term of place and voicing 
[(l-fl)f2]. 

The purpose of the present experiment was to reexamine the identification of 
place and voicing features in stop consonants and to determine by means of a new 
experimental paradigm whether these two phonetic features (i.e., place and voic- 
ing) are combined additively or nonadditively in stage 3 as shown in Figure 2. 
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stimuli for the experiment were three sets of synthetic speech sounds that varied 
systematically in the acoustic cues underlying the two phonetic features of place 
and voicing. One series of stimuli varied in the acoustic cues that underlie the 
phonetic features of place, while holding the voicing feature constant (/ba/ to 
/da/ with VOX at 0 msec). A second series varied the acoustic cues underlying 
the phonetic feature of voicing, while holding the place feature constant (/ba/ 
to /pa/ with F2 and F3 always rising). The final series varied the acoustic cues 
underlying both phonetic features simultaneously (/ba/ to /ta/) . These three 
sets of speech sounds were presented separately to listeners for identification 
into the categories /ba/-/da/, /ba/-/pa/, and /ba/-/ta/, respectively. The use 
of synthetically produced stimuli made it possible to control experimentally the 
correlation between place of production and voicing that Lislcir and Abramson 
(1964, 1967) had found in natural speech. 

Our principle aim was to determine whether the probabilities of identifica- 
tion along the bidimensional continuum (/ba/ to /ta/) could be predicted from 
some combination of the probabilities along the separate unidimensional series. 
We consider below two possible models of ways these separate features might be 
combined in the bidimensional case. Both Model I and Model II are concerned with 
the manner in which phonetic features are combined in phonetic perception. All 
processing up to stage 3 of Figure 2 is assumed to be independent according to 
the definition of independence for these stages given previously. We also assume 
that processing in stages 1 and 2 takes place in parallel and is automatic in the 
sense that Ss do not have control over these stages of perceptual processing. 
(See also Shiffrin and Geisler, 1973, and Shiffrin, Pisoni, and Castenada-Mendez, 
in press.) 

Model I; Linear Combination of Phonetic Features 

Hereafter, if a identified a stimulus as /ba/ it will be denoted B and 
likewise, /da/ as D, /pa/ as P, and /ta/ as T. In the /ba/ to /da/ series only 
the acoustic cues underlying the phonetic feature of place of articulation were 
varied. Since processing in stage 2 is assumed to be independent (i.e., separate 
for different phonetic features), the only variation in the output of stage 2 on 
the /ba/ to /da/ (place) series should be in feature f^, the phonetic feature of 
place of articulation. Accordingly, since the only variation in the input to 
stage 3 is in f]^, the output of stage 3 (a phonetic segment) is assumed to vary 
directly with the input (f-^) and thus accurately reflect fj^. However, due to 
noise in the acoustic waveform and in the first two stages of processing, the 
outputs of stage 2 are assumed to be probabilistic in nature. Thus, ^s' judg- 
ments of the stimuli from the/ba/ to /da/ series (the probability of responding 
D to a stimulus, Pr[D]) may then be construed as accurately reflecting the input 
(F]^) to stage 3. Similarly, Pr[P] from the /ba/ to /pa/ (voicing) series may be 
construed as accurately reflecting the input of the voicing feature (f2) to 
stage 3. Now, we can represent f-^ and f2 from equations 1 and 2 as follows: 



= Pr[D] on the /ba/-/da/ series (PLACE) 
= Pr[P] on the /ba/-/pa/ series (VOICING) 
Substituting equations 3 and 4 into equation 1 we obtain equation 5: 



(3) 



(4) 



Pr[T] = a Pr[D] + a, Pr[P] 



(5) 
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Pr[T] in equation 5 represents the probability of a T response on the bidimen- 
sional /ba/ to /ta/ series. 

One additional assumption will be made. This is shown in equation 6: 

a2 = 1 - a^^ where 0 £ a^^ £ 1 (6) 

This constraint is placed on aj^ and ^2 so that Pr[T] will equal one when both 
Pr[D] and Pr[P] are equal to one. Since only one parameter is being used, we 
delete the subscript from the constant a^. 

If we now combine equations 5 and 6 and delete the subscript on f:he constant 
a^ we obtain equation 7: 

Pr[T] = a Pr[D] + (1-a) Pr[P] (7) 



Equation 7 represents Model I. This model assumes independence of the features 
of voicing and place. If we estimate parameter from the data by the method of 
least squares, then Model I can be used to predict the bidimensional /ba/ to /ta/ 
identification function based on the unidimensional /ba/ to /da/ and /ba/ to /pa/ 
data. 



Model II; Nonlinear Combination of Phonetic Features 

V 

A development similar to that given for Model I may be applied to equation 2. 
If we combine equations 2, 3, 4, and 6, we obtain equation 8: 

Pr[T] = a* Pr[D] + (1-a') Pr[P] - b(l-Pr[D]) Pr[P] (8) 

Here, a^* is used to distinguish this parameter from the parameter of Model I. 
A major disadvantage of equation 8 is that it requires two different parameters, 
a^' and b^ ;;o be estimated from the data. 

Equation 8 assumes that S^s employ information about both phonetic features, 
place and voicing, to make their decision on the /ba/ to /ta/ series. However, 
either of these features alone may be sufficient for a to distinguish between 
/ba/ and /ta/ when only two response categories are permitted. For example, a 
could identify /ba/ and /ta/ on voicing alone (i.e., if voiced, respond /ba/; if 
voiceless, respond /ta/) or on place alone (i.e.>, if bilabial, respond /ba/ ; if 
alveolar, respond /ta/). Since these stimuli differ in both voicing and place, 
Ss may use only one of these features in their decision. However, it is also 
possible that a particular decision on one feature necessarily entails a particu- 
lar decision on the other. This is even quite likely considering the constraints 
on production. In production, a shift in place of articulation entails a shift 
in VOT, but not vice versa. On the other hand, in perception, the shift in VOT 
may serve as a cue to a shift in place of articulation. 

if- 

Previous investigators have found that decisions based on the voicing fea- 
ture are more consistent and, in some sense, easier than decisions based on other 
features, including place (Mller and Nicely, 1955; Studdert-Kennedy and 
Shankweiler, 1970: Shepard, 1972). One reason for this finding may be the multi- 
plicity of cues to the voicing feature (Liberman et al., 1958; Lisker and 
Abramson, 1964; Summerfield and Haggard, 1972), as compared with the relatively 
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restricted number of cues to the place feature. If subjects were to use only one 
feature, it seems likely that they would use the feature of voicing for the /ba/ 
to /ta/ series. We can operationalize this assumption by setting from equa- 
tion 8 to zero. This means that 1-a* will be 1 and that the probability of a /ta/ 
response (Pr[T]) will be the result of the amount of the voicing feature present 
minus the interaction of place and voicing. This is sunmiarized in equation 9, 
which represents Model II: 

Pr[T] Pr[P] - b(l-Pr[D]) Pr[P] (9) 

The term representing the dependence of voicing on place (l-Pr[D]) Pr[P] has 
been pperationalized this way to insure that PriT] does not become negative or 
greater than one. 

Model II can be used to predict the bidimensional /ba/ to /ta/ series from 
the /ba/ to /da/ and /ba/ to /pa/ data by estimating the parameter b with the 
method of least squares. By setting parameter ai' to zero, Model II assumes that 
S^s categorize the stimulus as either /ba/ or /ta/ on the basis of the voicing 
feature alone. Thus, parameter b^ may be used as an estimate of how much a S^^s 
decision on the voicing feature depends upon the place infox ; Ion in the stimu- 
lus. 

METHOD 

Subjects . Subjects were nine students in introductory psychology, partici- 
pating as a part of the course requirement. Each S^ was a native American speaker 
of English, right-handed, and reported no history of a speech or hearing disorder. 

Stimuli . The three synthetic speech syllable series were /ba/ to /da/, /ba/ 
to /pa/, and /ba/ to /ta/. Each series contained 11 stimuli. The /ba/ to /da/ 
series varied in the initial frequencies of the second and third formant transi- 
tions. The second formant varied from an initial value of 1,859 Hz (/ba/) to an 
initial value of 3,530 Hz (/da/) in ten equal steps. The /ba/ to /pa/ series 
varied in VOT from 0 msec VOT (/ba/) to a +50 msec VOT (/pa/) in 5 msec steps. 
Aspiration replaced the harmonics in the second and third formant transitions for 
the duration of the Fl cutback. The /ba/ to /ta/ series combined the two compo- 
nent changes in a one-to-one fashion, resulting in the third ll-stimuli sequence. 
All stimuli were of 300 msec duration with a 50 msec transitional period followed 
by a 250 msec steady-state vowel (/a/). The three series of synthetic stimuli 
were prepared on the speech synthesizer at Haskins Laboratories and recorded on 
magnetic tape. 

Procedure . The experimental tapes were reproduced on a high quality tape 
recorder (Ampex AG-500) and were presented binaurally through Telephonies 
(TDH-39) matched and calibrated headphones. The gain of the tape recorder play- 
back was adjusted to give a voltage across the headphones equivalent to 80 db SPL 
re 0.0002 dyn/cm for the steady-state calibration vowel /a/. 

On each tape S^s heard 10 presentations of each of the 11 stimuli in random 
order with 4 sec between stimuli. S^s were run in two groups, 5 Ss in the first 
group and 4 S^s in the second. Each group heard each tape three times, resulting 
in 30 judgments of each stimulus for each S^. In addition, the /ba/ to /ta/ tape 
was presented twice more with a different set of instructions. The order of tape 
presentation was randomized \vlth one group hearing the /ba/ to /da/ tape first. 
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For each of the tapes Ss were told that they would hear synthetic speech 
syllables and they were to Identify them as /ba/ or /da/, /ba/ or /pa/, /ba/ or 
/ta/. Ss were told to record their identification judgment of each stimulus by 
writing down the initial stop consonant in prepared response booklets. 

RESULTS AND DISCUSSION 

The identification probabilities for the /ba/ to /da/ (place) and /ba/ to 
/pa/ (voicing) series were in accord with previous experiments. All the stimuli 
at one end of the series were consistently categorized one way and all the stimu- 
li at the other end were consistently categorized the other way. There were a 
few transition stimuli (generally one or two in the middle of the series) which 
were categorized both ways at a near chance (.5) level. Data from two S^s were 
eliminated from subsequent analyses since they responded to the /ba/ to /da/ 
series at a chance level throughout. (One S_ came from each of the groups.) 
Identification functions from these two series for a typical S^ (S^ number 1) are 
shown in Figures 3A and 3B. The /ba/ to /ta/ identification function for the 
same is shown in Figure 3C. 

In order to estimate the weighting factor (ai) from Model I for each S^, a was 
allowed to vary from 0.0 to 1.0 i. increments of .02, The squared error between 
the predicted and observed identification functions was then calculated for each 
value of at. The value which resulted in the minimum squared error for each S^ was 
chosen as the best estimate of a^. These values of parameter a^ and their associ- 
ated squared errors are shown in Table 1. In six out of seven S^s the proportion 
of the variance accounted for by the predicted values exceeded 86 percent. The 
mean proportion of variance accounted for over all Ss by Model I was 89.7 percent. 





TABLE 


1: Variance accounted for by Model I. 


Subject 


Constants 


Minimum 


Percent of 




a 


1-ra 


Squared Error 


Variance Accounted For 


1 


0.66 


0.34 


.1898 


94.3 


2 


1.00 


0.00 


.3848 


86.2 


3 


0.30 


0.70 


.0633 


97.8 


4 


0.00 


1.00 


.1965 


92.6 


5 


1.00 


0.00 


1.1814 


62.9 


6 


0.00 


1.00 


.0301 


99.6 


7 


0.64 


0.36 


.1148 


94.7 



The data were analyzed a second time for Model II. Parameter a^* had been 
set to zero, in accord with the assumption that S^s would use only the voicing 
feature in making their judgment. Parameter b, the weight of the interaction 
term in Model II, was allowed to vary from 0.0 to 1.0 in increments of .02. The 
squared error between predicted and observed identification functions was also 
computed. The values for each S^ that resulted in minimum squared error are shown 
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in Table 2. The proportion of the variance accounted for by the predicted func- 
tion was computed and is shown in Table 2. The proportion of the variance 
accounted for by Model II is greater than or equal to that accounted for by 
Model I for every S^. The overall mean proportion of variance accounted for was 
92.9 percent in Model II. 



TALLE 2: Variance accounted for by Model II. 



Subj ect 


Constants 


Minimum 


Percent of 




a' 


1-a' 


b 


Squared Error 


Variance Accounted 


1 


0.00 


1.00 


0.66 


. 1831 


94.5 


2 


0.00 


1.00 


1.00 


.1790 


93.1 


3 


0.00 


1.00 


0.30 


.0570 


97.8 


4 


0.00 


1.00 


1.00 


.1064 


95.6 


5 


0.00 


1.00 


1.00 


.6570 


74.8 


6 


0.00 


1.00 


0.00 


.0301 


99.6 


7 


0.00 


1.00 


0.64 


.1154 


94.7 



Both Model I and Model II predict the identification probabilities along the 
bidimensional speech series reasonably well. However, predictions from Model II, 
the interaction model, fit the observed probabilities somewhat better than pre- 
dictions from the additive model. There was an increase in th^ proportion of 
variance accounted for in four out of the seven Ss with Model II. For three of 
the Ss the variance accounted for by Model II remained the same as in Model I, 
although the parameter values changed. In fact, the three S^s with the highest 
proportion of variance accounted for in Model I are the three S^s for whom Model II 
shows no gain. 

We suggested earlier that identification of the bidimensional series /ba/ to 
/ta/ might be based on the use of only one feature — voicing — since ^s were con- 
strained to only two response categories. Parameter in Model II was set at 
zero on the assumption that the place feature is based entirely on the voicing 
feature and would not contribute directly to the response decision. The strength 
of the interaction model. Model II, can be tested by letting parameter vary as 
in equation 8. Accordingly, when the squared error between the predicted and 
observed probabilities was obtained by equation 8, a^' was estimated to be zero 
for every S^. The estimates of parameter b^ were identical to those obtained with 
equation 9 where was previously set to zero. This suggests that our original 
assumption was correct. Ss apparently relied more on the voicing feature than 
the place feature in the two category bidimensional series. 

The extent to which place information enters into the voicing decision for 
each is reflected in parameter b from Model II (equation 9) . This parameter is 
greater than zero for all Ss except one, indicating that place information does 
affect the voicing decision, although only in terms of an interaction. Although 
the fit of the additive model (Model I) is good, the better fit of the interactive 
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model (Model II) and the generally nonzero estimates of the interaction term 
support the notion that the phonetic features of place and voicing in stop conso- 
nants are not combined independently in stage 3» 

Probabilities for the second identification function generated by Ss for the 
/ba/ to /ta/ series with four response alternatives were also computed. Although 
this condition was included in the experiment almost as an afterthought, the re- 
sults were not only surprising but consistent among S^s* The identification func- 
tion for the same representative as before in this condition is shown in Figure 
3D. The high probability of a P response for stimulus 7 and the distribution of 
P responses around this mode is of special interest. If a were responding P at 
random, the Pr[P] in this series should be .25 for all stimuli instead of approxi- 
mately zero everywhere except for a few stimuli. This same pattern of P responses 
was found for all Ss tested. The peak probability of a P and the stimulus at 
which it occurred are shown in Table 3. When the data for each is broken down 
by tape presentation, the same results are observed (see Table 3). 













TABLE 3: 


Peak Pr[P] 


in the second /ba/- 


-/ta/ series. 


Subject 


Peak 


Split-Half 


Pr[P] 


Stimulus Where 




Pr[P] 


1 


2 


Peak Occurs 


' 1 


.90 


1.00 


.80 


7 




1.00 


1.00 


1.00 


7 


3 


.85 


.90 


.80 


6,7 


4 


.80 


.70 


.90 


7 


5 


.80 


.90 


.70 


7 


6 


.85 


1.00 


.70 


8 


7 


.65 


.90 


.40 


7 



In contrast, the Pr[D] was much lower for all S^s except one, and showed 
greater variability when subject to split-half analysis. These data are shown in 
Table 4. One S^ reported only a single /da/ in 220 test trials. 

It would appear that the occurrence of /da/ identifications in the /ba/ to 
/ta/ series with four response categories may be randomly distributed. On the 
other hand, the occurrence of /pa/ identifications is highly consistent both 
within and across S^s and the peak probability is never less than .65. 

If the phonetic features of place and voicing combined separately and addi- 
tively in stage 3 as Model I would predict, the identification functions for this 
second series should resemble the data for the first /ba/ to /ta/ series. This 
did not occur, as shown in Figure 3D. S^s showed consistent use of the /pa/ re- 
sponse in the second /ba/ to /ta/ series at levels well above chance expectation. 
The peak Pr[P] in this second bidimensional series occurred at a stimulus whose 
place value generally corresponded to a high Pr[D] in the /ba/ to /da/ (place) 
series (see Table 5). Similarly, the peak Pr[P] stimulus in the bidimensional 
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TABLE 4: Peak Pr[D] in the 



Subject Peak Split-H. 

Pr[D] 1 

1 .10 0.00 

2 .80 .80 

3 .45 .60 

4 .35 .60 

5 .50 1.00 

6 .05 .10 

7 .30 .20 



ond /ba/-/ta/ series. 



Pr[D] Stimulus Where 

2 Peak Occurs 

.20 7 

.80 5 

.30 4 

.10 5 

0.00 5 

0.00 10 

.40 5 



TABLE 5: Peak Pr[P] in the second /ba/-/ta/ series; Pr[D] in the /ba/-/da/ 
series; and Pr[P] in the /ba/-/pa/ series to the corresponding 
stimulus. 



ject 


Peak 


Corresponding 




Pr[P] 


Pr[D] 


Pr[P] 


1 


.90 


.267 


.967 


2 - . 


1.00 


.800 


.967 


3 


.85 


.500 


.967 


4 


.80 


.900 


.933 


5 


.80 


.933 


.933 


6 


.85 


.733 


.833 


7 


.65 


.867 


1.000 



Stimulus 



8 

7 



series corresponds to a stimulus in the /ba/ to /pa/ (voicing) series which ex- 
hibits a high Pr[P) (see Table 5). A model that assumes separate, additive 
weighting of the features, such as Model I, would predict that the stimulus where 
the peak Pr[P] occurs in the second bidimensional condition would be categorized 
as /ta/ and not /pa/. 

A model to fit these four-response data was constructed based on equation 8. 
However, since Model II did not fit the four-response data very well, even when 
parameter was allowed to vary, another term was added in which the place fea- 
ture is dependent on the voicing feature. This model now reflects an interdepen- 
dence of these two features on each other. This model, summarized in equation 10 
below, has two parameters to be estimated: 

Pr[T] = a'Pr[D] - b'(l-Pr[P]) Pr[D] + (1-a') Pr[P] + (1-b') (l-Pr[D]) Pr[P] 

(10) 
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Figure 4: Observed identification function for a representative in the four- 
response /ba/ to /ta/ series (part A) and the predicted function for 
the same S using equation 10 (part B). 
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Equation 10 generally failed to predict the magnitude of the Pr[P] in the 
four-response condition, although equation 10 did generally predict a peak at 
stimulus seven. The fit of equation 10 to one S^^s data is shown in Figure 4. 
The obtained identification function is shown in panel A; the predicted function 
derived by equation 10 is shown in panel B. The response data for all conditions 
for this same were shown previously in Figure 3. 

The failure of equation 10 to predict accurately the entire set of probabil- 
ities for the second /ba/ to /ta/ series may be attributed to two possible fac- 
tors. First, S^s' identification functions for the two component series (/ba/ to 
/da/ and /ba/ to /pa/) were somewhat noisy. Second, processing in stages 1 and 2 
of Figure 2 may not be independent as we have assumed. Any nonindependence of 
processing, especially in stage 2 where the phonetic features are extracted, 
would affect the assumptions made in deriving Model I and Model II. 

In summary, an additive model which assumes independence in the processing 
of phonetic features cannot account for the identification functions when the 
acoustic cues underlying place and voicing in stop consonants are varied system- 
atically. Rather, it appears that an interaction model handles the data much 
better and provides additional support for the evidence previously reported by 
Lisker and Abramson (1964, 1967) and Haggard (1970) with different experimental 
procedures. The perception of an acoustic cue underlying a particular phonetic 
feature (e.g., place or voicing) may not be invariant with changes in the acous- 
tic cues underlying other phonetic features. This conclusion is scarcely sur- 
prising since covariations in the acoustic cues derive directly from production 
constraints, and is added evidence of the close link between speech perception 
and production. 

CONCLUSION 

An additive model which assumes independence of processing at all stages did 
a creditable job in predicting the response probabilities along a bidimensional 
series of synthetic stop consonants when S^s were constrained to two responses. 
However, a model that does not assume additive processing (i.e., nonindependence) 
in the stage where phonetic features are combined does even better than the inde- 
pendence (i.e., additive) model. When S^s ^re given four responses from which to 
choose, the additive model fails completely* In contrast, the nonadditive model, 
while not yielding an excellent fit, does predict occurrence of /pa/ identifica- 
tions on the /ba/ to /ta/ series. Based on these perceptual data with synthetic 
speech stimuli, we conclude that phonetic features in stop consonants are not 
combined independently to form phonetic segments. 
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The Lag Effect In Dlchotlc Speech Perception 
Emily F. Kirstein"^ 



INTRODUCTION 

An important factor in dichotic competition is the temporal alignment of 
syllable onsets at the two ears. It is well known that if different syllables 
are presented simultaneously to opposite ears, syllables at the right ear ara 
more accurately identified than those at the left (Shankweiler and SLuddert- 
Kennedy, 1967; Studdert-Kennedy and Shankweiler, 1970). Recently, it was dis- 
covered that the size of the right-ear effect can be increased by delaying syl- 
lable onsets at the right ear 5 to 120 msec behind the left and, conversely, 
that the ear advantage can be reduced or even reversed (giving a left-ear supe- 
riority) by causing the left-ear syllable to be delayed behind the right (Lowe, 
Cullen, Thompson, Berlin, Kirkpatrick, and Ryan, 1970; Studdert-Kennedy, 
Shankweiler, and Schulman, 1970). That is, in general, lagging syllables at 
either ear have an advantage over leading syllables; this "lag effect" is seen 
superimposed on the right-ear effect in dichotic experiments. 

In control experiments syllables differing in time of arrival by 5 to 120 
msec were mixed electronically and the resulting signal was delivered to one ear 
(monotic) or to both ears (diotic) . For these conditions leading syllables were 
more accurately identified than lagging syllables (Lowe et al., 1970; Studdert- 
KenneHy et al., 1970; Kirstein, 1971; Porter, 1971a). Studdert-Kennedy et al. 
(1970) attributed the diotic and monotic lead effect to peripheral masking. They 
considered the dichotic lag effect to be a higher-level phenomenon involving 
competition for perceptual processing. They proposed that the lagging syllable 
is more intelligible than th.i leading syllable because it interrupts the phonetic 
analysis of the leading syllable and captures the speech processor. 
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All the experiments referred to above used as stimuli the stop consonant- 
vowel (CV) syllables /ba/, /da/, /ga/, /pa/, /ta/, /ka/, where consonants dif- 
fered but vowels were shared. Subsequently, other stimuli have been used to 
determine precisely the conditions required to elicit the lag effect. Apparently, 
as long as there are stop consonants in CV syllables contrasting between ears, 
the lag effect will persist, despite substantial variations in the acoustics of 
interaural competition. For example, the lag effect has been demonstrated where 
vowels as well as consonants vary between ears, as in /ba/-/ge/ (Kirstein, 1971), 
where the fundamental frequency of the syllables varies between ears (Halwes, 
1969) , and where the duration of the competing syllables has been shortened from 
the usual 300 msec to only 75 msec (Porter, 1971a) . The minor influence of such 
acoustic variations strengthens the view that a critical condition for producing 
the lag effect is the "perceptual class" of the stimuli. 

It is not certain whether the lag effect is peculiar to speech or whether it 
is a more general phenomenon of auditory perception. Darwin (1971) asserted that 
the lag effect was related to the perception of rapidly changing acoustic signals 
(transitions) whether in speech or nonspeech tasks. He supported this claim by 
demonstrating a lag effect for perception of pitch changes (rising, falling, 
level) in the initial 50 msec of a 150 msec steady-state vowel. However, Porter 
(1971b) found no evidence of a lag advantage in perception of formant transitions 
isolated from the speech signal itself. Also, in Darwin's study, the use of a 
vowel as a carrier of the pitch transition makes it difficult to classify the 
sounds unambiguously as nonspeech. Thus, at present there is no strong evidence 
to refute the hypothesis of Studdert-Kennedy et al. (1970) that the lag effect is 
a speech perception phenomenon. 

Not all classes of speech sounds are equally effective for eliciting the lag 
effect. Porter (1971b) compared stop consonants (/b/, /d/, /g/) with sonorant 
consonants (/I/, /w/, /y/). He found that some subjects had a lag effect for 
both stops and sonorants while other subjects had the lag effect for stops only, 
rorter, Shankweiler, and Liberman (1969) presented steady-state vowels dichoti- 
cally with delays between ears, and they found a slight advantage for leading over 
lagging vowels. However, Kirstein (1971) found a preference for lagging vowels 
if the vowels were embedded in CV syllables. She also found a lag effect for 
isolated steady-state vowels among subjects who had previously taken a dichotic 
test involving stop consonants. The finding that the lag effect is an extremely 
robust effect for stops, less robust for sonorants, and marginal for vowels sup- 
ports the view that the effect is related to special decoding processes in speech 
perception. The term "encoding" has been used by Liberman, Cooper, Shankweiler, 
and Studdert-Kennedy (1967) to refer to the fact that the acoustic cues for per- 
ception of a particular phoneme may be greatly affected by the nature of the ad- 
jacent phonemes. Liberman et al. (1969) proposed that highly encoded phonemes 
like stop consonants require special decoding to arrive at phonetic identifica- 
tion, while unencoded phonemes like the isolated steady-state vowels could op- 
tionally be identified through purely auditory perception modes. (Sonorant con- 
sonants are more highly encoded than vowels, but less encoded than stops.) The 
finding that the lag effect occurs for vowels under some circumstances but not 
under others suggests that the lag effect is related to the perceptual mode 
(speech or nonspeech) rather than to some acoustic feature of the stimulus. 

The present research is concerned with the methodology of lag effect experi- 
ments, and specifically with the role of attention. In dichotic experiments 



82 



listeners are generally required to attend simultaneously to both ears. The 
question examined here is whether leading syllables might be as accurately iden- 
tified as lagging syllables are if attention were concentrated on leads or lags 
only, rather than divided between them. It had been claimed by Inglis (1965) and 
by Treisman and Gleffen (1968) that the ear advantage in dichotic tasks can be 
attributed to systeiaatic biases on the part of subjects in their order of report- 
ing the two ears or in the distribution of attention between ears. To control 
for these factors in the study of ear effects Kirstein and Shankweiler (1969) 
introduced the procedure of having listeners concentrate on one ear at a time, 
reporting only the syllables at the "attended" ear. They found that for dichoti- 
cally presented stop consonants report was more accurate under right-ear atten- 
tion than under left-ear attention. They concluded that neither response bias 
nor the distribution of attention could explain the ear asymmetry. In dichotic 
experiments where syllables are temporally offset between ears, the distribution 
of attention and response ordering might also affect the pattern of identifica- 
tion errors. It seemed desirable, therefore, to determine whether the lag advan- 
tage would occur with attention directed selectively toward lagging or leading 
syllables. 

METHOD 

A dichotic tape consisting of pairs of CV syllables was constructed. Sylla- 
bles within a pair always contrasted in the initial consonant (/b/, /d/, or /g/) 
but shared the same vowel. Within a pair, one syllable was always delayed rela- 
tive to the other by 10, 30, 50, 70, or 90 msec. This tape was presented to sub- 
jects under three different task conditions. In the two-response task the listen- 
ers were instructed to report both consonants on each trial and to indicate which 
was the clearer. This is essentially the method of Studdert-Kennedy et al. 
(1970). In the ear monitoring task subjects were instxnicted to concentrate their 
attention on a particular ear and to report only the consonants at the ''attended" 
ear. In the temporal order task listeners were instructed to attend to the order 
of arrival of the consonants within a pair and report only the leading stops or 
only the lagging stops according to instructions. 

Stimuli . Nine consonant-vowel syllables were S3nithesized on the Raskins 
Laboratories' parallel resonance speech synthesizer. These were /ba/, /da/, /ga/, 
/be/, /de/, /ge/, /bo/, /do/, /go/. The duration of each syllable was 350 msec. 
Syllables beginning with /d/ or /g/ started with a 10 msec noise burst, followed 
by appropriate formant transitions. No burst was needed with /b/. In all cases 
the formant transitions were completed and the steady-state vowel parameters 
reached within 70 msec. 

The intelligibility jf the syllables was assessed by asking four listeners 
unfamiliar with synthetic speech to identify the consonants in the 180-trial 
randomization. Ninety-five percent correct identifications were obtained, an 
adequate level of intelligibility for the dichotic tests. 

Waveforms of the stimuli were stored on a computer disc file using the Pulse 
Code Modulation System (Cooper and Mattingly, 1969) . This computer system also 
controlled the alignment of syllable onsets as the syllables were recorded in 
pairs in a specified order onto the two-channel dichotic test tape. 

In the design of the dichotic tape, care was taken mo counterbalance to pre- 
vent confounding of ear effects, lag effects, and tape channel imbalance. Each 
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ear received the same number of lag and lead trials, and the same permutations 
of the syllables. From the nine CV syllables there are 18 possible permutations 
of two syllables where only the consonants differ. A 180-trial randomization was 
assembled in which each of the 18 permutations occurred twice at each of the five 
delay intervals (10, 30, 50, 70, and 90 msec), once with channel-1 delay and once 
with channel-2 delay. All conditions of offset were randomly ordered on the 
taper There was a 6-sec pause between pairs and a 10-sec pause after ten pairs. 

Procedure . The dichotic tape was played from a General Radio stereo tape 
deck into a special amplifier built by Zeichner of Raskins Laboratories for 
group dichotic experiments. As many as six subjects could be tested at one time. 
The subjects listened to the tape over Grason-Stadler stereo headphones. The 
tape was presented at a comfortable listening level, and the output intensity 
from the two channels was equated to within 1 db with the aid of calibration sig- 
nals on the tape. 

As an added control for channel effects, the dichotic tape was always pre- 
sented twice within a test session with the headphone orientation physically re- 
versed on the second presentation. 

Two-response task . The subjects were told that they would receive two dif- 
ferent syllables on each trial, one syllable to each ear, and that the syllables 
would differ between ears in the consonant (/b/, /d/, or /g/), never in the vowel. 
The instructions were to report both of the consonants, giving two different re- 
sponses and guessing if necessary. An added aspect of this task waa the clarity 
judgment . The responses were to be ordered on the answer sheet so that the 
clearer consonant was written in the first column for each pair and the less 
clear consonant in the second column. If one of the two responses was a guess, 
it was to be written in the second column. 

Each subject had 360 trials in a single 1-hour test session. 

Ear monitoring task t The subjects were told that they would receive two 
different syllables on each trial, one syllable at each ear, and that the sylla- 
bles would differ between ears in the consonant, (/b/, /d/, or /g/), neveT- 1^ the 
vowel. The instructions were to attend to a particulr>r ear designated by ^ : • 
experimenter and to write down on each trial the consonant arriving at the 
"attended" ear. The subjects were required to respond on every trial even if 
the response was a guess. Each subject had 360 trials under left-ear attention 
and 360 trials under right-ear attention. 

Each subject had two 1-hour (360-trial) test sessions, with each session 
subdivided into four 90-trial blocks. At the start of a block the subject was 
told whether to report the right ear or the left, and this instruction was in 
effect for the entire block. Within a session the order of the blocks was Right- 
Left-Left-Right or Left-Right-Right-Left; a subject was randomly assigned to one 
of these orders for his first session and was automatically assigned to the other 
order for the second session. 

Temporal order task . The subjects were told that they would receive two 
different syllables on each trial, one at each ear; that the syllables would dif- 
fer between ears in the consonant (/b/, /d/, or /g/); and that the syllables 
would also differ slightly in onset time, with the leading syllable randomly at 
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the right or left ear. The instructions were to attend to the order of arrival 
of the two syllables and to report either the lagging or leading syllable, as 
specified by the experimenter. The subjects were required to respond on every 
trial even if the response was a guess. Each subject had 360 trials reporting 
the leading syllables and 360 trials reporting the lagging syllables. 

Each subject had two 1-hour (360-trial) test sessions, with each session 
subdivided into four 90-trial blocks. At the start of a block the subject was 
told whether to report the leading syllables or the lagging ones, and this in- 
struction was in effect for the entire block. Within a session the blocks were 
arranged Lags-Leads-Leads-Lags or Leads-Lags-Lags-Leads; a subject was randomly 
assigned to one of these orders for his first session and was automatically 
assigned to the other order for the second session. 

Subjects . The ear monitoring and temporal order tasks were originally 
studied together as part of a single experiment. A group of 12 subjects took 
both tasks, half taking the temporal order first, and half the ear monitoring 
task. Later, an additional group of 10 subjects was run on the ear monitoring 
task only, making a total of 22 ear monitoring subjects. (The data analysis re- 
vealed no systematic differences between the original group of 12 and the added 
group on the ear monitoring task. For purposes of analysis sometimes only the 
original 12 subjects are considered while in other cases data are presented for 
the entire group of 22.) 

The two-response task was performed by 24 subjects, none of whom had previ- 
ously participated in dichotic expieriments. 

The subjects, all Introductory Psychology students at the University of 
Connecticut, received course credit for their participation. They were native 
speakers of English, were self-classified "right-handers," and had (to their 
knowledge) normal hearing in both ears. 

RESULTS 

Two-response task. The percent correct responses on the two-response task 
is shown in Figure 1, presented according to the various stimulus conditions: 
lag vs. lead, right vs. left ear, and length of interaural delay. The data rep- 
licate the earlier findings: lagging syllables were more accurately identified 
than leading syllables, and the right ear was more accurate than the left ear. 
The stops that were lagging and at the right ear were most often correctly iden- 
tified, next in accuracy were left-ear lags, then right-ear leads, and finally 
left-ear leads. The only exception to this ordering was at 10 msec, where right- 
ear leads were slightly better than left-ear lags. An analysis of variance on 
the two response data is summarized in Table 1. Both the right-ear effect and 
lag effect were highly significant. Also, averaging over lags and leads and over 
the left and right ears, there was a statistically significant rise in the number 
of correct identifications for longer interaural delay intervals. Figure 1 shows 
that both the lag advantage and ear advantage vary in magnitude depending on the 
length of interaural delay; however, in the analysis of variance only the inter- 
action of lag effect with delay was significants 

The clarity- judgment instructions were apparently being followed, since 
errors occurred primarily on second responses: 5 percent of responses in the 
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Figure 1: Percent correct responses on two-response task (Ns^ZA) . Each 
is based on 864 trials, 36 trials per subject. 
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first column were errors, but 28 percent of second responses were errors. Errors 
on second rssponses decreased with longer interaural delay intervals, but the 
first-response error rate was independent of delay interval. An analysis of 
first responses showed that lagging stops were judged to be clearer than leading 
and the right ear was judged clearer than the left. (The first-response results 
are shown in Figure 7, which is discussed later. Of the 24 subjects, 23 favored 
lags over leads in first responses, and 18 favored the right ear over the left. 
(Only the direction of preference was tabulated, not the magnitude of the effect.) 

Ear monitoring and temporal order tasks . The results of the ear monitoring 
task are shown in Figure 2 and the temporal order results in Figure 3. Both 
tasks gave essentially the same results. On the ear monitoring task, listeners 
were more accurate on right-ear attention than on left-ear attention, and they 
were more accurate when the attended ear received a lagging syllable than when it 
received a leading syllable. On the temporal order task,, listeners made fewer 
errors under "report lags" instructions than under "report leads" instructions, 
and report of either lags or leads was more accurate from the right ear than from 
the left. The analysis of variance for the selective listening tasks is summa- 
rized in Table 1. For both tasks the lag effect and right-ear effect were sig- 
nificant, and both the ear effect and the lag effect showed significant varia- 
tions in magnitude with the length of interaural delay. Accuracy of report by 
ear or by temporal order improved significantly with longer delay intervals. 

Since 12 subjects took both the temporal order and ear monitoring tasks, it 
was possible to make a comparison of the consistency of individual ear effects 
and lag effects across the two tasks. A measure of each subject's lag effect was 
obtained for the temporal order task by subtracting the number of correct re- 
sponses under "report leads" from the number of correct responses under "report 
lags." For the ear monitoring tasks, a measure of the lag effect was obtained by 
summing the number of correct right-ear and left-ear responses when the attended 
ear received lagging syllables and subtracting the number of correct responses 
when the attended ear received leading syllables. Similarly, a measure of the 
ear effect was determined for each subject on each task. The reliability of in- 
dividual differences across tasks was assessed by calculating the Pearson product- 
moment correlation coefficient. The individual lag effect measures gave a corre- 
lation coefficient of .85 across tasks, and the ear effect measures gave a corre- 
lation coefficient of .95, both of which are highly significant, with p < .001. 
These results indicate that individual lag effect scores and individual ear 
effect scores are reliable, even when the measures are obtained on different 
tasks. 

The types of errors made on the ear nonitoring and temporal order tasks fell 
into two categories. The response could be identical to the unattended syllable, 
in which case it is termed an "intrusion" error; or the response could differ 
from both the attended and unattended syllables, in which case it is termed a 
"nonintrusion" error. Figure 4 gives a breakdown of all responses as correct re- 
sponses, intrusion errors, or nonintrusion errors. Intrusions were the primary 
source of errors under both attention conditions. That is, listeners had diffi- 
culty in discriminating between the attended and the unattended syllables. The 
ability to select the attended syllable improved with longer delay, and Figure 4 
shows that this improvement is entirely due to a reduction in the number of in- 
trusions of unattended syllables. Somewhat surprisingly, nonintrusion errors 
increased slightly with longer interaural delay, whereas they might have been 
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Figure 2: Percent correct responses on ear monitoring task (N"22) 
is based on 792 trials, 36 trials per subject. 
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Figure 3: Percent correct responses on temporal order task (N=12). Each point 
is based on 432 trials, 36 trials per subject. 
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expected to decline. For the ear monitoring task, the increase in nonintrusion 
errors with longer interaural delay proved to be statistically significant 
[Friedman two-way analysis of variance Xr - 1^*8 P < .01 (Siegel, 1956)]. Over- 
all, reporting a particular ear was more accurate than report by order of arrival 
(F = 17.8 df = 1,11 p < .005). 

Lag effect and ear effect as functions of delay interval . Thus far it has 
been demonstrated that there is an advantage for lagging syllables and right-ear 
syllables in all three tasks. The present section examines how the magnitude of 
lag effect and ear effect varied with length of delay. 

The lag effect was treated independently of the right-ear effect by comput- 
ing the mean percent correct on right- and left-ear lags and subtracting from 
this the mean percent correct on right- and left-ear leads. Figure 5 displays 
these lag effect scores at each delay for the three tasks. For the two-response 
task separate plots are shown for the lag effect in clarity judgments (first 
responses) and for the lag effect in intelligibility (both responses). The same 
trend was observed in all tasks: the advantage for lags over leads progressively 
increased with longer interaural delay up to 50 msec; with still longer delays 
the lag effect began to diminish. The lag effect showed a maximum at 50 msec for 
all three tasks, so this location must be considered a reliable finding, at least 
for this particular set of stimuli. 

The finding of a lag effect peak in the 50-70 msec delay range is also in 
agreement with other observations (Berlin, Lowe-Bell, Cullen, Thompson, and 
Loovis, 1973; Studdert-Kennedy et al., 1970; Porter, 1971b). 

A right-ear effect score at each delay was computed similarly by subtracting 
the mean percent correct on left-ear leads and lags from the mean percent correct 
on right-ear leads and lags. These scores are plotted in Figure 6. The right- 
ear advantage was greatest at short interaural delay intervals and declined with 
longer delays. Again all three tasks showed the same trend, although for the 
two-response task the change in ear effect with delay was not statistically sig- 
nificant. From these results it can be inferred that the right-ear advantage 
would have been maximal with simultaneous onsets. 



While the pattern of results was basically the same in all tasks, the magni- 
tude of ear advantage and lag advantage did vary considerably among the tasks, as 
can be seen in Figures 5 and 6. Statistical factors were considered first in 
trying to account for these magnitude differences. It can be shown that the max- 
imum possible lag effect or ear effect which can be obtained in the two-response 
task is only 50 percent whereas in the selective listening and clarity judgment 
tasks a 100 percent lag effect or ear effect could be obtained. A correction 
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The 50 percent ceiling on the lag effect in the two-response task derives from 
the fact that the subject must give two different responses on each trial from 
a set of only three possible responses. Suppose that the listener always heard 
correctly the lagging consonants but not the leading consonants. His guess for 
the second response would nevertheless be correct for the leading consonant for 
half the trials. Thus, the lag effect would have been 100 percent considering 
only first responses but is automatically reduced by half when both responses 
are considered. The same argument applies to the ear effect. Using a larger 
response set would have given a higher ceiling on the magnitude of the effects 
obtained in the two-response task. 
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must be applied to the two-response data (multiplying all ear effect and lag ' 
effect scores by 2) before comparing the two-response with the other tasks» How- 
ever, even after this correction has been applied, the magnitude of the effects 
obtained in clarity judgments is still greater than in any of the other tasks. A 
possible explanation for this result is that clarity judgments are more sensitive 
than the other tasks to the effects being studied. The clarity Judgment task re- 
quired listeners to compare the competing syllables qualitatively, while the two- 
response and selective listening tasks asked subjects to identify the syllables. 

The idea that clarity judgments are more sensitive than identification tasks 
receives support from a more detailed comparison of clarity judgments with re- 
sponses on the ear monitoring task. Figure 7 compares these two tasks directly, 
plotting on the same chart the percent "clearer" judgments for the various tem- 
poral offset conditions against the percent of trials on which these same sylla- 
bles were correctly identified under selective listening. If the curves for the 
two tasks were to coincide, this would indicate that the subjects could correctly 
identify in ear monitoring only those stops which were independently judged to be 
the relatively clearer within the pair. Divergence between the ear monitoring 
aid clarity judgment curves gives the percentage of trials on which the "less 
clear" consonant could be identified when attention was concentrated on that 
sound. It is evident that very frequently the "less clear" sound could in fact 
be identified. It is interesting also that at short delays the responses given 
were not greatly affected by the specific task instructions, while longer delays 
produced greater divergence between tasks. The greater sensitivity of the clar- 
ity judgment over the identification task can be seen in the fact that as we move 
from the most favored condition (right-ear lags) to the least favored conditions 
(left-ear leads) the change between conditions is greater for clarity judgments 
than for identification. 

DISCUSSION 

All three tasks gave essentially the same patterns of identification errors. 
The effects obsen/ed in all tasks were the lag effect, the right-ear effect, the 
variation in the magnitude of lag effect and ear effect with delay, the improved 
performance with longer delay, and the susceptibility to intrusion errors in the 
selective listening tasks. The consistency in error pattern across tasks is con- 
vincing evidence that there are genuine variations in intelligibility of dichotic 
syllables depending on ear of arrival, order of arrival, and temporal offset ♦ 
These effects are clearly not under the listeners' voluntary control, and they 
cannot be explained in terms of attentional strategies or response order. 

It is often said that dichotic presentation causes errors because of percep- 
tual competition between stimuli, that is to say^ because the perceptual system 
is unable to process both ears simultaneously. The term "perceptual competition" 
is usually interpreted to mean that verbal stimuli presented simultaneously to 
opposite ears must compete for entry to language processing areas. The assump- 
tion that only one ear at a time can have access to language areas of the brain 
underlies much theorizing about the ear effect and lag effect. For example, 
Kimura (1961, 1967) attributed the right-ear advantage for dichotically presented 
verbal stimuli to competition between inputs for access to language processing 
areas of the left hemisphere; the right ear was thought to win the competition 
because of the greater strength of its neuroanatomical connections to the left 
hemisphere. The explanation of the lag effect offered by Studdert-Kennedy et al. 
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(1970) also assumes that only one ear at a time can be admitted to the speech 
processor. They proposed that the lag effect occurs because the leading sylla- 
ble is ejected from the processor when the lagging syllable arrives on the oppo- 
site channel. It will be argued in the discussion that follows that recent data 
do not support the idea that the ears must compete for entry to the speech pro- 
cessor. An attempt will be made to redefine the notion of dichotic competition 
and to account for the ear effect and lag effect in light of that definition. 

The view that dichotic stimuli compere for perceptual processing was elabo- 
rated by Broadbent (1958) in his "filter cheory" of selective attention. Accord- 
ing to that theory, the flow of sensory data to central processing areas is regu- 
lated by a filtering mechanism which blocks peripherally all irrelevant sensory 
channels. Broadbent treated the two ears as separate sensory channels. He con- 
sidered that in dichotic listening only one ear could be processed at a time and 
that the system would have to "switch channels" in order to process both ears. 
Peripheral blocking of one ear and channel switching were assumed to be under a 
subject's voluntary control. 

A motivation for the filter theory was to explain the finding of Cherry 
(1953) that when different continuous messages were prescmted to opposite ears 
listeners could easily repeat back (shadow) the message at one ear and ignore the 
other. Moreover, they showed no retention of the unattended message. Broadbert 
attributed the shadowing results to the filtering out of the unattended message 
so that only the attended message could have access to linguistic processing. 
However, in subsequent research using the shadowing technique the unattended ear 
was not fully suppressed. For example, Treisman (1960) found that wor Is in the 
unattended messages would occasionally be repeated by the shadower if those words 
were semantically probable within the context of the attended message. Also, if 
the experimenter switched messages between ears, subjects would sometimes also 
switch ears unconsciously, maintaining the continuity of the shadowed message. 
Such intrusions from the unattended ear relating to the semantics of the messages 
indicate that the unattended message must have been analyzed linguistically, and 
subjects' failure to retain the content of the unattended message in these tasks 
must be attributed to memory rather than perception. 

Ear monitoring experiments with dichotic syllables prove even more strikingly 
that listeners cannot voluntarily select a particular ear for perceptual analysis 
and filter out the other. When stop-vowel syllables are presented dichotically 
with simultaneous onsets, subjects frequently confuse ears when instructed to 
report a particular ear (Kirstein and Shankweiler, 1969; Halwes, 1969) and when 
instructed to report both syllables by ear (Gerber and Goldman, 1971). The pre- 
sent results expand these earlier findings by showing that confusions between ears 
persist even when syllable onsets are not precisely simultaneous. Accuracy of 
monitoring is, however, related to interaural delay interval, and the fact that 
intrusion frequency depends on the timing relation between ears is strong evidence 
that listeners cannot voluntarily exclude a particular ear from processing. The 
results suggest, rather, that selection from a particular ear depends on the 
physical relation between the dichotic stimuli. The ear monitoring results con- 
tradict two assumptions of the filter theory: first, that dichotic inputs are 
always strongly "tagged" by ear of origin, and second, that listeners can volun- 
tarily turn off an ear in dichotic tasks. 

If the unattended ear cannot be voluntarily inhibited peripherally, then how 
are we to account for the ease of selectively shadowing a particular ear for 




continuous dichotic messages? Moray (1969) considered this matter and concluded 
that the mechanism of selection of a particular ear is the same in dichotic 
listening as in ordinary binaural situations where many messages are arriving 
simultaneously to both ears — the so-called "cocktail party" effect • A particular 
message can be selected for attention providing there are sufficient physical 
cues to establish a distinct spatial origin for the message* In everyday listen- 
ing situations the two ears do not behave as independent channels* Ordinarily 
the same signals arrive at both ears, perhaps with slight differences in time, 
phase, or intensity between ears; these interaural differences provide the physi- 
cal cues for spatial localization of the sound. Thus, normally inputs at the two 
ears are integrated to yield a unitary percept and are compared to locate the 
sound source (Cherry, 1961). Moray proposed that dichotic inputs are handled in 
essentially the same manner. 

If we assume, following Moray, that in dichotic listening the auditory sys- 
tem compares stimuli from the two ears and locates the source based on differ- 
ences betv;3en ears, we can then readily understand why reporting stop consonants 
from a particular ear is difficult, while shadowing continuous messages is much 
easier. For ongoing messages the acoustic signals at opposite ears would be 
generally quite distinct at any point in time, so that by comparing the two inputs 
the auditory system could establish two distinct sound sources and a subject could 
shadow the message emanating from a particular location. In stop-vowel syllables 
the acoustic information distinguishing one stop from another is contained within 
the first 70 msec or less of any syllable, and the vowels are acoustically iden- 
tical at the two ears. Thus, if selection between ears presupposes a clear acous- 
tic distinction between simultaneous dichotic signals, it is understandable that 
selection would be faulty for CV syllables. The role of acoustic similarity be- 
tween ears in selective listening for dichotic stop consonants was demonstrated 
convincingly in an experiment by Halwes (1969). He presented CV syllables dicho- 
tically with syllable onsets precisely aligned for simultaneity and compared 
accuracy of selective report for syllable pairs which shared fundamental frequency 
at the two ears or which differed in fundamental frequency. When fundamental 
frequency was shared, listeners were unable to distinguish the attended from the 
unattended ear; performance was significantly improved when fundamental frequen- 
cies varied. The present study shows that another physical dimension, interaural 
asynchrony of onsets, is also important in selection. 

Confusion between ears for dichotic stops was attributed by Halwes to an 
acoustic fusion effect. Halwes claimed that the perceived localization of a syl- 
lable presented to one ear was shifted toward the midline when an acoustically 
similar syllable was delivered at the opposite ear. Often the listeners heard a 
single syllable localized at the midline or diffusely rather than at a particular 
ear. Earlier, a similar phenomenon was described by Broadbent and Ladefoged 
(1957). They presented the first formant of a synthetic vowel to one ear and the 
second formant to the opposite ear; these fused perceptually into a single vowel. 

While listening to the tapes for the present experiments, I observed fre- 
quent fusion for short delay (10-30 msec) trials. For these, generally only one 
of the two stops could be heard, and it often could not be definitely assigned to 
either ear. For longer delays, two stimuli could usually be detected although, 
often the identity of one was still unclear. These ob3ervations accord with the 
experimental data. On the two-response task, for example, the subjects could 
identify one stop from each pair, regardless of delay, but correct identification 
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^Q^^^ stops was facilitated by longer offsets. For ear monitoring, longer 
delays reduced intrusions; this effect can be explained by the assumption that 
selection can occur only under conditions where both syllables can be detected. 

A somewhat surprising result in the ear monitoring task was that while in- 
trusion errors decreased with longer interaural delay, nonintrusion errors in- 
creased. The increase in nonintrusion errors with longer delays may result from 
the fact that at the shortest delays one nearly always hears one of the two syl- 
lables clearly, although that might be the unattended syllable. At longer delays 
listeners would be better at discriminating the attended from the unattended syl- 
lable, but they would often be unable to identify the attended syllable. That 
is, the increase in nonintrusion errors with longer delay can be related to the 
temporal alignment condition where the attended syllable can be selected but not 
identified. 

Dichotic fusion phenomena are of interest because they support the hypothe- 
sis that there is perceptual integration of dichotic stimuli. Cutting (1972) 
proposed that dichotic fusion can occur at various levels of perceptual process- 
ing. He considered the effects described by Halwes (1969) and by Broadbent and 
Ladefoged (1957) to be low-level or auditory fusions because both effects depend 
on a purely acoustic property of the stimuli: the fusion is disrupted if funda- 
mental frequency is varied between ears. Other types of fusion have been dis- 
covered in recent experiments where the integration apparently arises at higher 
perceptual levels and does not depend so critically on acoustic parameters. An 
example of a higher-level fusion is the phenomenon of "featur?^ blending" 
(Studdert-Kennedy and Shankweiler, 1970). If stop consonants contrast between 
ears in two distinctive features, place of articulation and voicing, many re- 
sponses are "blends" where the voicing feature at one ear is combined with the 
place feature at the other. Studdert-Kennedy, Shankweiler, and Pisoni (1972) 
found no change in frequency of blend responses whether vowels are shared be- 
tween ears, as in /pi/-/di/, or whether vowels vary, as in /pi/-/du/; based on 
this result they argued that left- and right-ear syllables are not blended acous- 
tically, but that abstract phonetic features become mixed between ears in the 
course of phonemic identification. Switching of stimulus elements between ears 
was also reported by Treisman (1970) for dichotic consonant-vowel-consonant syl- 
lables, differing between ears in all three phonemes; responses often combined 
phonemes from opposite ears, for example, "taz" + "geb" — ^ "teb." A final type 
of dichotic fusion to be considered is the combination of simultaneously pre- 
sented dichotic consonants to yield a perceived sequence of consonants (Day, 
1968). Day presented a word beginning with a stop consonant to one ear and a 
word beginning with a liquid to the opposite. For example, one ear received 
"lack" and the other "back." Many subjects (about half) reported hearing a 
single word beginning with a stop-liquid cluster, "black;" the remaining subjects 
heard one or both of the actual stimulus items. This effect has been termed 
"phonological fusion" because the structure of the response is apparently deter- 
mined by a rule of English phonology which permits stop + liquid clusters sylla- 
ble initially but prohibits liquid + stop. Phonological fusion occurs with CV 
syllables as well as with more complex words (Cutting, 1973). 

The existence of perceptual fusions such as feature blending, phoneme 
switching between ears, and phonological fusion argues strongly against the 
hypothesis that only one ear at a time has access to the speech processor. For 
fusion to occur, stimuli to both ears must enter the processor and undergo 
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phonetic analysis within a single *'tiine frame/* Moreover, it is interesting that 
feature blending was observed originally in a task giving a highly reliable 
right-ear effect. Apparently, the same stimulus material can give rise to dif- 
ferent types of perceptual effects — either fusion effects (where the response 
contains parts of both stimuli) or suppression effects (where only one of the two 
stimuli is correctly repojted) . If there is a mechanism for preventing an over- 
load of sensory data into the speech processor, it is unlikely that such a device 
would work so sporadically, producing sometimes suppression and sometimes fusion. 
It is argued here that the right-ear effect, lag effect, and fusion effects all 
represent outcomes of normal speech perception strategies applied to stimuli 
arriving at the speech processor from both ears. 

A significant observation in connection with the lag effect is that fusion 
responses for stop-liquid pairs occur frequently even when the stop and liquid 
are temporally offset between ears by 50-100 msec (Day, 1970; Cutting, 1973). 
Day found a constant fusion rate whether stop or liquid led and regardless of 
length of offset. In Cutting's work, stop + liquid cluster responses were more 
common when the stop led, but still occurred frequently when the liquid led. The 
occurrence of fusion for stop-liquid pairs contradicts the notion that the 
arrival of a delayed syllable automatically causes the leading syllable to be 
ejected from the speech processor. The fusion results suggest that if the lag- 
ging syllable causes interruption of the processing of the leading syllable, it 
does so only after the phonetic class of the lagging syllable (stop or liquid) 
has been determined. We may infer that both lagging and leading sounds undergo 
at least rudimentary phonetic analysis. 

Why would the lag effect and ear effect occur regularly for stop-stop and 
liquid-liquid pairs but not for liquid-stop or stop-liquid? Two explanations 
seem reasonable. Perhaps the notion "perceptual processor" has been too broadly 
defined. If there are separate processors or "feature detectors" for stops and 
for liquids, then a stop-liquid pairing may not constitute a condition of per- 
ceptual competition. The second approach is to view the situation linguistically. 
In English phonology stop-liquid consonant clusters are permitted syllable ini- 
tially, while liquid-stop, stop-stop, and liquid-liquid are prohibited sequences. 
If the speech processor were attempting to integrate the two inputs into a single 
syllable, a "correct" response would be available only for stop-liquid pairs. 
Liquid-stop pairs could be construed as stop-liquid, but for liquid-liquid or 
stop-stop there is no possible response which could integrate the two ears. 

Thus, even if inputs from both ears are admitted to the speech processor, 
there are still at least two levels of processing at which perceptual competition 
could rise. First, there is the level of "feature extraction," where distinctive 
phonetic features like voicing, place of articulation, nasality, etc. are identi- 
fied. Second, there is a higher level where the percept, consisting of a sequence 
of segmental phonemes, is formed. The phonetic features extracted from the 
speech signal provide only part of the relevant information for the decisions 
made at this second stage. Here the phonological rules of the language play a 
role. For example, nasality occurring during a vowel would be assigned by the 
phonological rules of English to the following consonant, but for French, nasal- 
ity would be assigned to the vowel itself. Also, in running speech, not all 
words are clearly articulated, and context facilitates understanding. The influ- 
ence of context on speech perception may operate at this stage. 
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It is proposed here that the right-ear effect arises at the feature extrac- 
tion level of processing, but that the lag effect and perceptual fusions both 
arise at the phonemic decision stage. For this model to work, we must assume 
that the feature extraction stage involves an independent analysis of acoustic 
cues for both values of a binary feature. That is, rather than being either 
voiced or voiceless, the output of voicing analysis for English stops might be 
both voiced and voiceless, providing there were sufficient cues (aspiration, 
VOT, fQ contour, etc.) to support both analyses. The output from feature extrac- 
tion would include a weighting or probability based on the number of acoustic 
cues consistent with each feature value. For example, on a dichotic trial with 
a voicing contrast between ears (+) voicing could have a .80 weighting and (-) 
voicing a .50 weighting. Under binaural conditions (i.e., without competition) 
the analysis would favor a particular value more strongly. This approach is 
consistent with the finding in dichotic studies that listeners often do identify 
stops at both ears correctly, and that listeners rarely misperceive feature 
values which are shared between ears (Kirstein and Shankwei.ler , unpublished 
data) . This view is consistent with the importance of multiple acoustic cues in 
perception of many distinctive features and with the context sensitive nature of 
the cues (e.g., VOT for voiceless stops varies with place of articulation). 

In their model of the ear effect, Studdert-Kennedy and Shankweiler (1970) 
attributed the right-ear advantage to degradation in the auditory representation 
of left-ear syllables due to their more circuitous neural connections to the left 
hemisphere. The^* considered that ipsilateral connections were inhibited in 
drlchotic presentation so that left-ear stimuli could reach speech processing 
areas in the left cerebral hemisphere only by first traveling to the right hemi- 
sphere and then crossing to the left via the corpus callosum. If the right ear 
provides superior sensory data to the speech processor, this would, in the pre- 
sent model, produce higher weightings for features extracted from right-ear syl- 
lables. However, there might conceivably be reasons for weighting the right ear 
more highly than the left besides the quality of their auditory representation. 
The model claims simply that both ears are admitted to the speech processor but 
that the syllables originally presented at the right ear emerge from the feature 
extraction stage of processing with higher weighting than left-ear syllables. 
This idea is consistent with the finding that stop consonants contrasting between 
ears in only voicing or place of articulation give smaller ear effects than con- 
sonants contrasting in both features (Shankweiler and Studdert-Kennedy, 1967; 
Studdert-Kennedy and Shankweiler, 1970). 

The output from the feature extraction level is also the input to a higher 
level of processing where the identity and order of segmental phonemes are decided. 
Because the input to this stage may contain incompatible feature values (e.g., 
voiced and voiceless^ labial and velar), there must be perceptual strategies 
available to resolve conflicting analyses. It is proposed that the lag effect 
reflects one such strategy and that feature blending and phonological fusion re- 
flect others. The blending in the response of place and voicing from opposite 
ears could be a function of weights assigned during the feature analysis. The 
highest voicing and place values may have been originally from opposite ears. 
This account agrees with Studdert-Kennedy and Shankweiler (1970) in considering 
the locus of feature blending to be at a stage of processing subsequent to fea- 
ture extraction where features are combined to yield segmental phonemes. 

How many distinct syllables the listener can hear for a dichotic pair prob- 
ably depends on acoustic fusion as well as on the phonological analysis. The 
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auditory system might recognize two sound sources or only one, and this conclu- 
sion could in turn affect the number of phonemic solutions produced by the pho- 
nological analysis. A striking result in dichotic experiments is the wide varia- 
tion among individuals in frequency of fusion responses for stop-liquid pairs 
(Day, 1970) and in their accuracy of selecting a particular ear in ear monitor- 
xng tasks. Such individual differences might arise in the auditory analysis of 
spatial localization for stimuli which are physically similar but not identical. 

During the phonological analysis a number of possible phonemic solutions 
might be rejected; the listeners would "hear" only those that are finally 
accepted. Solutions involving low-weighted features would probably be rejected. 
It is suggested that the lag effect occurs because as a perceptual strategy leads 
tend to be rejected more than lags. That is, if the feature extraction produces 
first one feature value and then another, incompatible with the first, the pho- 
nological analysis may strongly favor the second result. The second feature 
value might be treated as if it were a revision of the first. The claim is being 
made that both the lagging and leading syllables undergo phonetic analysis but 
that the leading syllable is subsequently rejected. This would occur for stop- 
stop pairs or for liquid- liquid pairs. For stop-liquid pairs the leading sylla- 
ble would not be rejected because stop-liquid is an acceptable phonemic sequence. 

It is argued that the lag effect and right-ear effect both involve speech 
perception, but thc^t within the course of speech processing the two effects are 
independent. Moreover, it is claimed that the ear effect arises at a level of 
processing prior to che lag effect. Is there evidence in the present data to 
support these hypotheses? First, there is the fact that the lag effect and ear 
effect have completely different temporal parameters: the ear effect is greatest 
with the shortest offsets while the lag effect increases with longer delays up to 
60 msec. Beyond the different time functions for the two effects, there are 
other aspects of the data which suggejt that the effects arise at different 
levels . 

At any delay, the size of the ear advantage was the same whether measured on 
lags — as percent correct right ear lags-left ear lags — or measured on leads — as 
percent correct right ear leads- left ear leads. It is puzzling that the size of 
the ear effect should be the same on lags and leads because the conditions of 
competition between ears are quite different in the two cases. The onset of a 
lagging syllable always coincides with some portion of an ongoing stimulus at the 
other ear but the onset of a leading syllable coincides with silence in the other 
ear. Since it is clear from the data that the ear effect is enhanced by simul- 
taneous competition, we would expect the ear difference in lags to be much 
greater than in leads. That is, left-ear leads should suffer only slightly in 
comparison with right-ear leads, but for lags the left ear should be much poorer 



For the two-response and temporal order tasks there was no interaction between 
size of ear effect and lag vs. lead. In ear monitoring there was a signifi- 
cantly smaller ear effect in leads than in lags. However, the analysis pre- 
sented here was based on correct responses only. In intrusion responses from 
the unattended ear, the right-ear effect was greater in lags than leads so if 
all ear monitoring responses are considered, there is no interaction between ear 
effect and lag effect. 
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than the right ear. Interestingly, the model just proposed provides a way of 
accounting for the data. Suppose that the ear effect arises at the feature 
extraction level of processing and that at this level the ear difference is only 
on lags: features extracted from left-ear lags have received lower weights than 
those extracted from right-ear lags. At the phonological decision stage there 
is a tendency to favor lags over leads. The asymmetry between ears arising 
initially in lags would then be mirrored in leads as well. Right-oar leads are 
always paired with left-ear lags; these would be rejected less frequently as a 
phonemic solution than left-ear leads, which are paired with right-ear lags. 

The relation between the right-ear effect and lag effect has always seemed 
paradoxical because all models of the ear effect imply a temporal advantage for 
the right ear. This is supported by Springer (1971), who reported a 50 msec 
right-ear advantage in reaction time for correctly identified dichotic stops. 
The procedure of temporally offsetting stops at the two ears was proposed origi- 
nally as a possible method of measuring the right-ear effect (Studdert-Kennedy 
et al., 1970; Berlin et al., 1973). It was thought that giving the left ear a 
certain lead time would make left- and right-ear stops equally intelligible. 
The finding that left-ear lags wash out the ear advantage was, thus, incompatible 
with our understanding of the ear effect. The present view in which the lag 
effect and ear effect arise at different levels of perceptual processing offers 
a solution to this paradox. In this model only left-ear lags suffer from the 
laterality effect, not left-ear leads. Thus, the use of a left-ear lead to com- 
pensate for the right-ear advantage would be effective at the lower level of 
processing but would be wiped out at the higher level. By assigning the lag 
effect to a higher level of perceptual analysis than the ear effect, we leave 
open the possibility that the lag effect may not be specific to speech perception. 
The lag effect may reflect a general strategy in various sense modalities for 
handling incompatible inputs in pattern perception. 
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State-of-the-Art Report on Language Processing* 
A. M. Liberman"^ 

Raskins Laboratories, New Haven, Conn. 



To provide a framework for our discussion, I will set down in outline form 
the questions that arise in my mind when I wonder how we might get language into 
the hearing- impaired child. Some of these questions were raised by Ira Hirsh in 
his keynote speech the other evening, which reinforces my belief that they will 
help to organize our discussion this morning. 

I should confess at the outset that I know very little about deaf children, 
even less, indeed, than I know about language processing. It is the more appro- 
priate, therefore, that I should try to make the outline neutral. But I do have 
views (even biases) that may, in one way or another, influence what I say, so, 
before proceeding with the outline, I should get them on the record. The most 
relevant of these concern the function or purpose of grammar. My colleagues and 
I have written about those views in other places, and at length (Liberman, 
Mattingly, and Turvey, 1972; Mattingly, 1972a; Liberman, 1973) if only for that 
reason, I should be as brief as possible. 

I believe that grammar — or, more exactly, grammatical receding — serves to 
reshape information to en£.ble the speaker-listener to move it efficiently between 
a nonlinguistic intellect, which has facilities for the processing and long-term 
storage of ideas, and a transmission system, where sounds are produced and per- 
ceived. Without the grammatical reshaping that comes so naturally to all normal 
human beings, we should have to communicate our ideas by means of sounds that 
were uniquely and holistically different from each other — one sound pattern, how- 
ever simple or complex, for each idea. Tn that case, the number of ideas we 
could transmit would be limited to the number of sounds we can produce and iden- 
tify. (Precisely that limitation applies to the normal communication of nonhuman 
animals if, indeed, it is true that those creatures lack the capacity for gram- 
matical coding.) We do not know exactly how many messages could be transmitted 
by that kind of "language." But, given the richness of the intellect and the 
comparative poverty of the transmission system, the scope of such a nongrammati- 
cal '^language" is orders of magnitude less than that which is afforded by the 



*Edited transcript of a talk presented at a workshop on Sensory Capabilities of 
Hearing-Impaired Children held October 26 and 27, 1972, in Baltimore, Md.; to be 
published in a book of the same title, edited by Rachel E. Stark (Baltimore, Md. 
University Park Press, in press). 

^Also University of Connecticut, Storrs, and Yale University, New Haven, Conn. 
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grammars that are so readily available to human beings and that, in my view, set 
language apart from other perceptual and cognitive processes (Liberman et al., 
1972; Liberman, 1973). All this is to imply what some of my colleagues and I 
believe about the biology of grammar — that the capacity for grammatical process- 
ing evolved as a kind of interface, matching the output of the intellect to the 
vocal tract and the ear. If that is so, the biological development of those 
grammatical processes should have been influenced by the possibilities and limi- 
tations of the mismatched structures they connect. Natural grammatical processes 
should, then, reflect those influences. Unfortunately, we do not know how far 
"up" and "down" the grammatical interface the effects of intellect and transmis- 
sion system go. I have some guesses, based on the formal resemblances between 
speech and the rest of grammar (Mattingly and Liberman, 1969) , but I see no point 
in inflicting them on you. I will, however, suggest that the point of view I 
have expressed here is relevant to our concerns. When congenital deafness pre- 
vents the use of the normal transmission system, what are the consequences for 
grammar? If the deaf child b3rpasses speech entirely, as in the case of natural 
sign, must he then use a grammar different from the grammar of spoken languages? 
If so, what are the differences and, more to the point, how far "up" the system 
do they extend? If the grammar of spoken languages is not appropriate for the 
transmission system used in sign, how adequately can the signer adapt it? Or 
must he contrive a more suitable grammar? If so, how well does this more suit- 
able grammar work, and at what cost in effort? 

My views about grammatical recoding take a more specific form the more 
closely we approach to speech at the transmission end of the system. They also 
become more directly relevant to our concern, since we must surely try to under- 
stand speech if we want to know how the deaf child might cope with it. Thus, I 
think we should want to understand the function of the phonetic representation 
so that we can better appreciate the consequences, if any, of the failure to 
develop it in a proper way. We should also want to understand the relation be- 
tween the phonetic representation and the sound, for we cannot otherwise see how 
the deaf child might use prosthetic devices that drastically alter the acoustic 
signal or, in the extreme case, transform it for delivery to the eye or the skin. 
(See: Liberman, Cooper, and Studdert-Kennedy, 1968.) 

Let us consider first the function of the phonetic representation in the 
conversion between ideas and sounds. We do not know the shape of ideas in the 
intellect, but we should doubt that they are strung out in time as they are after 
they have been transformed into sound. If that doubt is well founded, we should 
suppose that the meaning of the longer segments of language (for example, sen- 
tences) must transcend the meaning of the shorter segments (for example, words) 
that they comprise. There is, then, a need for a buffer in which the shorter 
segments can be held until the meaning of the longer segments has been extracted. 
I suspect that the universal phonetic features became specialized in the evolu- 
tion of language as appropriate physiological units for storage and processing in 
that short-term buffer. (See: Liberman et al., 1972.) Since the substitution 
of sign for speech does not remove the need to spread ideas in time, we must 
wonder how or how well the need for short-term storage is met. We should also 
wonder what happens when, instead of bypassing the normal transmission system 
entirely, as in natural sigu, one rather enters directly (if only approximately) 
at the level of the phonetic representation by finger spelling or by writing 
(and reading) the letters of the alphabet. For if those substitute signals do 
not engage the phonetic features, then the deaf child may have to make do with 
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other representations — for example, visual or kinesthetic images. How efficient 
are these nonphonetic representations for the storage and processing in short- 
term memory that the perception of language may be presumed to require? (See, 
for example, Conrad, 1972.) 

But suppose that instead of avoiding the sound by representing the phonetic 
message directly, (as in finger spelling or alphabetic writing) we try, as many 
have, to present the acoustic signal to the deaf child in a form (for example, 
spectrographic) suitable for presentation to an organ other than the ear. One 
of the problems we will then encounter arises from the nature of the relation be- 
tween the sound and the phonetic message, A great deal of evidence supports the 
conclusion that speech is not an alphabet or simple substitution cipher on the 
phonetic representation, but rather a complex and grammatical code (Liberman, 
Cooper, Shankweiler, and Studdert-Kennedy, 1967; Liberman et al., 1968). Indeed, 
if speech were not a complex code it could not work efficiently, for just as the 
transmission system is not well matched, most broadly, to the intellect, so also 
is it unable, more narrowly, to deal directly with the phonetic representation. 
Thus, the rate at which the phonetic message is (or can be) communicated — up to 
20 or 30 phonetic segments per second — would far exceed the temporal resolving 
power of the ear if, as in a simple cipher, each phonetic segment were repre- 
sented by a unit sound. But there is another, equally important problem we 
should expect to have if the phonetic representation were transmitted alphabeti- 
cally: the listener would have great difficulty identifying the order of the 
segments. Though little is known about the ability of the ear to identify the 
order of discrete (nonspeech) sound segments, recent work suggests that it fails 
to meet the normal requirements of speech by a factor of five or more (Warren, 
Obusek, Farmer, and Warren, 1969). That is, when segments of distinctive non- 
speech sounds are arranged in strings of three or four, their order can be 
correctly identified only when the duration of each segment is five or more 
times longer than speech sounds normally are. 

The complex speech code is a grammatical conversion that nicely evades both 
those limitations of auditory perception: several segments of the phonetic 
message are commonly folded into a single segment of sound, which takes care of 
the problem posed by the temporal resolving power of the ear; and there are con- 
text-conditioned variations in the shape of the acoustic cues that provide non- 
temporal information about the order of segments in the phonetic message, thus 
getting around the ear's relatively poor ability to deal with order on a tem- 
poral basis. (See: Liberman et al., 1967; Liberman et al., 1972.) But for our 
purposes the important point is that these gains are achieved at the cost of a 
very complex relation between phonetic message and acoustic signal. We are not 
normally aware of how complex this relation is because the decoding is ordinarily 
done by an appropriately complex decoder that speech has easy access to. Unfor- 
tunately for the needs of the deaf child, however, that decoder is connected to 
the auditory system. What happens, then, when we try to present the raw (that 
is, undecoded) speech signal to some other sense organ, such as the eye? On the 
basis of what we know about speech we can, I think, understand some of the dif- 
ficulties that are encountered; we can also, perhaps, see opportunities that 
might be exploited. 

I should like to turn now from a more specific concern with grammatical pro- 
cesses near the transmission end of the system to consider some hypotheses about 
language that deal with grammatical processes more generally. In speaking of the 
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function of these processes, I have suggested that by appropriately interfacing 
mismatched structures of intellect and transmission, grammar makes possible the 
efficient communication of ideas from one person to another. But I believe that 
an equally important function of a grammar is to enlarge the possibilities for 
communicating ideas to oneself. By getting ideas out of the inarticulate intel- 
lect and down at least part way into the language system, we conceivably achieve 
a kind of control that we could not otherwise have managed. If so, having a 
grammar confers on us much the same kind of advantage that a mathematics does. 
A significant part of normal human cognitive work may then depend in one way or 
another on grammatical processes. In that case we have reason to be concerned 
about the consequences that may follow when these processes are tampered with. 

I have also spoken of the human intellect as though it were in no sense 
linguistic — that is, as if all the acconmiodating to the transmission system had 
been done by the development of th^ grammatical interface. That leaves out of 
account the possibility that in the evolution of language the intellect and the 
transmission system themselves underwent alterations that tended to reduce the 
mismatch. In the case of the vocal tract, indeed, there is evidence that such an 
accommodation did occur. The vocal tract of human beings is different from that 
of other primates (Lieberman, 1968; Liebennan, Klatt, and Wilson, 1969; Lieberman, 
Crel in, and Klatt, 1972) , and the difference appears to have produced for us a 
greater ability to transmit the phonetic message, thus easing somewhat the job 
that the speech grammar has to do. But what of the other end of the system? Was 
the originally nonlinguistic intellect also altered in the direction of a better 
fit to the other structures in the linguistic system? We do not know, of course, 
but if it was, then we should have to suppose that the human intellect is to some 
extent specifically adapted to normal grammatical processes. Given that possi- 
bility, we have another reason for wondering whether alteration of normal gram- 
matical processes might have consequences for intellectual ability. 

Throughout this introduction I have spoken of "natural" grammatical recod- 
ings, which implies a bias I particularly want to get on the record — namely, that 
such recodings are not arbitrary inventions or cultural artifacts, but rather the 
reflections of deeply biological processes. I believe, as do many other people 
who concern themselves with language, that human beings come equipped with the 
capacity to develop grammars, including, as I have already emphasized, the gram- 
mar of speech that connects the phonetic message to the acoustic signal. To the 
extent that we force these processes into unnatural channels, we can expect to 
encounter difficulties. Unnatural grami .rs will very likely be hard to learn, 
especially if they are as complex as they may need to be. Indeed, the fact that 
people do not learn to read spectrograms suggests that we cannot, by learning, 
acquire a grammar of speech or make the natural grammar work with an organ other 
than the ear (Liberman et al., 1968). 

Now let us turn to the outline I spoke of at the beginning, the one that 
might help us to organize our discussion. Though the shape of the outline con- 
forms rather well to the views I have just talked about, the outline itself does 
not prejudge any of the issues it raises, or so I hope. The larger division in 
the outline is between those methods that would aim at delivering to the hearing- 
impaired child as close an approximation to the spoken language as possible, and 
those that would use a different transmission system, such as, for example, the 
gestures of sign. The first method is further divided between those presenting 
speech in unencoded form (that is, as a signal from which the phonetic message 
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has not been extracted) and those presenting it in decoded form (that is, for 
example, as a phonemic or phonetic transcription). With undecoded speech there 
is, of course, an additional, subordinate choice among modalities: do we pre- 
sent the signal to the ear, the eye, or the skin? 

I> COMMUNICATION OF A STANDARD, SPOKEN LANGUAGE 

It seems reasonably obvious that we should want, if possible, to develop in 
the deaf child a reasonable approximation to standard, spoken language. Because 
the greatest number of natural grammatical processes is then used, the fullest 
possible development of language becomes a relatively easy matter, and there is 
the least risk of crippling the kinds of cognitive processes that normal grammati 
cal processes ordinarily serve. Those advantages are, of course, in addition to 
giving the child access to standard literature of all kinds and the ability to 
communicate more readily with normal-hearing people. I do not mean to propose 
that we eschew all other possibilities, since the advantages of trying to give 
the child an approximation to a standard language can be outweighed by many con- 
siderations. Indeed, I do not mean to propose anything here, but only to frame 
the possibilities. 

A. Transmission of the Undecoded Speech Signal 

I said in my introductory remarks that there is a complexly encoded relation 
between the phonetic message and the acoustic signal. The salient characteristic 
of the speech code is that information about successive segments of the phonetic 
message is often transmitted simultaneously on the same parameter of the sound. 
As a consequence, there is, in general, no acoustic criterion by which one can 
identify segments of sound that correspond in size or number to the segments of 
the phonetic message, and the acoustic shape of the cues for a phonetic segment 
will often vary greatly as a function of context* The perception of speech re- 
quires, then, a complex decoding operation* In this section we will consider 
those ways of presenting speech, including even rather elaborately processed 
speech, in which, no matter how well the speech signal penetrates the person's 
deafness, the decoding job has yet to be done. But first, by way of introduction 
I should say more about the speech code and the speech signal. Thus, I should 
emphasize that the relation between phonetic message and sound is not always that 
of a complex grammatical code; there are, intermittently, quite transparent or 
unencoded stretches. In those parts of the speech signal that carry the phonetic 
message in encoded form, there is, as I have pointed out, the complication that 
information about more than one phonetic segment is carried simultaneously on 
the same acoustic parameter. In the trans].;arent or unencoded stretches, however > 
there is no such complication: a segment of sound carries information about only 
one phonetic segment. In slow articulation the vowels and fricatives, for 
example, are reasonably transparent, as are some aspects of the distinctions 
among phonetic manner classes. The fact that the phonetic message is sometimes 
encoded in the speech signal and sometimes not becomes important later in this 
section of the outline when we consider how to present the speech signal to an 
organ other than the ear. 

I should also emphasize here that there is an aspect of the speech signal 
that has, in principle, nothing to do with encodedness, but that nevertheless can 
make speech hard to deal with, especially for the deaf. I refer to the well- 
known fact that speech is, from an engineering standpoint, a very poor signal. 
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The acoustic energy is not highly concentrated in the first two or three for- 
mants, which carry most of the important linguistic information, but is rather 
smeared broadly through the spectrum. Moreover, some of the most important 
acoustic cues are rapid frequency changes of the formants, the so-called for- 
mant transitions; such rapid frequency swings are, by their nature, physically 
indeterminate. In the processing we normally do in speech perception, therefore, 
we must not only decode the signal so as to n.'icover the phonetic segments which 
are so complexly encoded in it, but also, apparently, clean up the signal — track 
the foraiants, as It were — and deliver to the decoder a clearer parametric de- 
scription of a still undecoded signal. I know of no evidence that human beings 
have devices (shall we call them property filters?) to do that job. It is none- 
theless relevant to our concerns, however, to know that the linguistically im- 
portant acoustic cues are poorly n^presented, and to wonder, then, whether we 
might help the deaf by altering speech to make it a better signal. 

1. Getting the undecoded speech signal in by ear . If we are to deal with 
the undecoded speech signal, then we should want, if possible, to get it in by 
ear in order to take advantage of all the physiological equipment, including 
especially the speech decoder, that is naturally connected to the auditory sys- 
tem. But we must then alter the speech signal in some way that is calculated to 
evade the condition of deafness. The simplest and most common alteration is 
amplification. I will not discuss that remedy further, except to say the obvi- 
ous, that it does not always solve the problem. 

I would rather consider other, more complicated alterations in the speech 
signal. Here I have in mind that, as I said in the introduction to this section, 
the speech signal may be hard to deal with, not only because of its peculiarly 
complex relation to the phonetic message, but also because the important cues arc 
not among the most prominent features of the acoustic landscape. By using what 
we now know about those cues, and by taking advantage of the techniques that en- 
able us to manipulate them in convenient and flexible ways, I should think we 
might be able to make speech significantly more intelligible to the deaf. We 
should want first, for this and for other more general purposes, to extend our 
knowledge about the acoustic cues by discovering exactly which ones deaf people 
can and cannot hear. Then we should explore the possibility of producing a more 
effective signal by putting the acoustic energy where it counts, and by specifi- 
cally reinforcing certain cues. Of course, many of the alterations that might, 
on a common-sense basis, be expected to help could only be managed with totally 
synthetic speech, since it is beyond our present technological capabilities to 
process "real" speech as to produce those patterns that are likely to prove most 
effective. But it is nonetheless worthwhile, I think, to see how much better we 
can do with even the most extreme, synthetic departures from normal speech. We 
all know that what is technologically not feasible today is child's play tomor- 
row, so if we find that certain kinds of synthetic speech can be got through to 
the deaf better than natural speech, we can look forward realistically to the 
possibility of someday being able to produce such signals from "real" speech. 
But there might also be an immediate application. I have in mind the problems of 
the congenitally deaf child and the possibility that the development of his lin- 
guistic system might be promoted — or. more exactly, not held up — if speech could 
more effectively be got through to him. Of course, if we could provide him only 
with exposure to appropriately tailored synthetic speech, he could not interact 
with it in the normal way. Still, he might, like the chaffinch, gain something 
important if his normal language mechanisms had proper data to work on. 
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There are other possibilities for alterations in the speech signal that 
might also increase intelligibility for the deaf. In that connection I should 
like to take particular note of some work done recently by Timothy Rand (1973). 
That work is the more relevant because a member of our conference, Dr. PicKett, 
has results that are related to those of Rand, and Dr. Pickett will, I believe, 
describe those results for us at this session. Rand has found that when the 
formants are split between the ears the two higher formants are, to a signifi- 
cant extent, released from the masking effects of the lowest one. More specifi- 
cally, the procedure and the findings are as follows. Using synthetic speech to 
have the stimulus control he needs. Rand presents binaurally the syllables [ba], 
[da] , and [ga] , which are distinguished only by the transitions of the second 
and third formants. He then determines by \diat amount he must reduce the intensi- 
ties of the second and third formants to bring the subjects' accuracy of identifi- 
cation down from nearly 100 percent, where it is before the intensity reduction, 
to a level just slightly above chance. In another condition, he carries out ex- 
actly the same procedure, but this time with dichotic rather than binaural pre- 
sentation. In the dichotic condition the first formant is presented to one ear, 
the second and third formants to the other. The first thing to be said about 
the results is that, as had been known before, the listener fuses the two inputs 
quite readily and hears an intelligible utterance. But, for our purposes, the 
more important result is that, in order to produce a reduction in intelligibility 
equal to that of the binaural condition. Rand must, in the dichotic condition, 
reduce the intensities of the second and third formants by an additional 15 db. 
That is, in the dichotic condition the transition cues for the stop consonants 
can, other things equal, be heard (and used) by the subjects at a level 15 db 
lower than that required in the normal binaural condition. Thus, it is as if the 
dichotic presentation produced a 15 db release from masking. I should emphasize 
that Rand's work has been done with normal-hearing subjects ^ and the degradation 
in the speech has so far been only in the form of intensity reduction. Still, we 
might want to consider the implications that Rand's work could have for improving 
speech intelligibility with the deaf. Perhaps Dr. Pickett will do that. 

2. Getting the undecoded sijgnal in through & nonauditory modality . Over 
the years, and especially in the recent past, attempts have been made to help 
the deaf by presenting the speech signal to the eye or the skin. Those attempts 
were very adequately reviewed by Dr. Pickett at the 1970 meeting in Easton, 
Maryland. As our contribution to an earlier meeting at Gallaudet, Franklin 
Cooper, Michael Studdert-Kennedy , and I undertook to describe the difficulties 
facing anyone who tries to decode the acoustic stream of speech without the aid 
of the physiological decoder that normally does it for him (Liberman et al., 
1968). Indeed, the source of those difficulties should be apparent on the basis 
of what I have said here today about the complexly encoded nature of the relation 
between the acoustic signal and the phonetic message. If the sounds of speech 
were an alphabet on the phones — that is, if there were a discrete acoustic seg- 
ment for each phonetic segment, or if the segments were merely linked as in 
cursive writing — then it should be no more difficult to read spectrograms than 
than to read print. (Of course we should still have to contend with the fact 
that signal-to-noise ratio of speech would be poorer by far than that of print; 
that would, however, pose no very serious problem.) But, as I have said already, 
the relation of the speech signal to the message it carries is not that simple. 
Though the speech code matches the requirements of the phonetic representation 
to the particular limitations of the transmission system, thus permitting these 
two structures to work well together, it does so at a price; to extract the 
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phonetic message from the acoustic signal requires a special and complex decoder. 
Such a decoding mechanism is apparently quite readily available to all human 
beings, but, unfortunately for our present purposes, it is connected to the 
auditory system, and experience in trying to learn to read spectrograms suggests 
that it cannot be transferred to the eye (or the skin) . 

Given what we know about the speech code and the way it is normally per- 
ceived, we have reason to be pessimistic, I think, about the possibility that 
the eye or the skin can ever be a wholly adequate substitute for the ear as a 
pathway fo7: speech sounds or even as an alternative entry to the speech decoder. 
It does not follow, however, that no useful information about the speech signal 
can be transmitted through nonauditory channels. There are, as I have pointed 
out, relatively transparent or unenccded stretches of speech in which the rela- 
tion between acoustic signal and phonetic message is quite straightforward. 
Since these stretches are not in need of complex decoding, they might be more 
readily "understood" when transmitted through the eye or the skin. 

At all events, I would suggest that in the design of prosthetic aids for the 
deaf we take into account what we now know (or could, by further research, learn) 
about the speech code. We should then more clearly see both the difficult prob- 
lems and the promising possibilities. 

B. Transmission of the Decoded Speech Signal 

In an alphabetically written language there is a fairly straightforward re- 
lation — a rather simple substitution cipher, indeed — between the segmented opti- 
cal shapes and the phonetic or phonemic segments they represent. We might sup- 
pose, therefore, that in presenting language to the eye of the deaf child it 
would be the better part of wisdom not to offer the raw speech signal, which re- 
quires decoding, but rather an alphabetic representation, which does not. Indeed, 
this seems the more reasonable because we know that while normal— hearing people 
have not learned to read spectrograms, some have learned to read language in an 
alphabetically written form. 

But the matter is not that simple. There is abundant evidence that reading 
is a secondary linguistic activity in the sense that it is grafted onto a spoken- 
language base (Mattingly, 1972b). Thus, reading came late in the history of our 
race. Moreover, an alphabet, which represents the decoded phonetic segments, is 
the most recently invented orthography, and it is significant that it has been 
invented only once. Most relevant of all, of course, is the fact that among 
normal-hearing children many who speak and perceive speech perfectly well never- 
theless cannot learn to read. 

We should not be surprised, then, to discover that congenitally deaf chil- 
dren, having had little or no chance to master the primary spoken language, find 
it exceptionally difficult to acquire a secondary, written . form of it. Indeed, 
the fact that such children have more than the normal amount of trouble learning 
to read, and that they do not normally attain so high a final level of achieve- 
ment, is itself strong evidence for the essentially secondary nature of reading. 
It seems intuitively reasonable to me that a child (or anyone else) should have 
difficulty mastering the grammatical (as opposed to the lexical) elements of 
language if his initial and only exposure is to the written forms, but I don't 
know how to talk about that in any intelligent way. I will only say, therefore, 
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that it Is surely important to us that reading is significantly harder for those 
who do not speak—that it is, in effect, difficult to acquire the language by 
eye. 

How much do we know about this and what else should we try to learn? Is the 
deaf child's success in reading related to his ability to deal, by whatever means, 
with the spoken language? If so, what is the nature of the relation? Is there 
some kind of threshold effect — that is, is some certain amount of competence with 
the spoken language enough to enable the child to break through and acquire the 
rest of the language by reading? Can we discover whether experience with particu- 
lar aspects of the spoken language is more Important than experience with some 
others? And what does it mean, precisely, to say that a congenitally deaf child 
reads poorly? What kinds of errors does he make, for example, and how do those 
compare with the errors made by normal-hearing children? Are the deaf child's 
errors spread evenly (or randomly) over all aspects of language, or do the diffi- 
culties pattern in ways that make sense in terms of anything we know about lan- 
guage? Is there any factual support for my intuition that the deaf child might 
have more trouble with the grammatical items than with the lexical ones? Is that 
what is reflected in the comment I heard from one of the participants at this 
conference, that teachers sometimes refer to the perfonnance of deaf children in 
reading as "noun calling?" If , as I suggested earlier, the phonetic representa- 
tion normally provides an efficient vehicle for storage and processing in short- 
term memory, what kinds of alternative representations are available to the deaf 
child, and how well do they work for the same purpose? 

Our outline would be incomplete if we omitted another method of communicat- 
ing decoded speech to the deaf child, though in this case the decoding is not 
complete and only some aspects of speech are communicated at all. I refer to 
"lip reading." The gestures of articulation occur at a stage just prior to the 
one where much of the most severe encoding occurs* Though the gestures do not 
thereby escape as many complications as my colleagues and I had once supposed, 
still they are, by contrast with the acoustic signal, more simply related to the 
phonetic message. To the extent that the deaf child can see at least some of the 
articulatory gestures, he has access to a reasonably straightforward representa- 
tion of the phonetic message ♦ Conceivably, we will want to consider today what 
we now know or ought to try to learn about lip reading. We may also want to 
wonder whether there are greater possibilities with that method than have yet 
been realised. 

II. ^ COMMUNICATION BY AN OTHER-THAN-SPOKEN LANGUAGE 

Givian the problems that the deaf child has with speech, we must consider 
alternative means of communication. Surely the most obvious. and important of 
these is sign language. Unfortunately for our purposes, and for me, I know al- 
most nothing about sign, so I will not presume to talk about it. All that I can 
do is to include it in our outline as a subject that you may want to discuss, 
and, more presumptuously, raise a few questions that my own biases lead me to 
ask. 

Seeing grammar as a kind of interface, I assumed in my introductory remarks 
that it might bear the marks of the several structures, intellect and transmission 
system, that it connects. On that basis I raised questions about the consequences 
of using a different transmission system. In sign the transmission system is very 




115 



different, involving neither the vocal tract nor the ear. I should ask, then, as 
I did earlier, whether the grammar of sign is different from that of any spoken 
language, and if so, exactly how different? (Apart from its relevance to our 
understanding of the deaf, an answer to that question should be of interest to 
students of language, because it tells us something about how far up the grammat- 
ical interface the effects of the transmission system extend.) If the grammar of 
sign is very different, is there a price to be paid, either in effort or in 
efficiency, for not being able to use, as the normal-speaking person does, those 
grammatical processes that presumably evolved with language and are now a part of 
our physiology? You probably know more than I do about research on sign, includ- 
ing, for example, the work of Stokoe (1960) or that of Bellugi and Fischer (1972). 
If so, I hope you will include sign in our discussion. In any case, it is time 
for me to stop talking and, instead, to invite from you the comments that are the 
principal purpose of this meeting. 
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Introduction 

A frequent topic of complaint from the blind and visually handicapped con- 
cerns the long delays that often occur In receiving recordings of spoken texts# 
The alleviation of these delays by use of a High Performance Reading Machine 
which can provide supplementary reading services to blind people Is the goal of 
the research being carried out at Hasklns Laboratories. From a technical 
standpoint, the results of this research Indicate that the automatic production 
of spoken text from print is entirely feasible. Thus, a reading machine of 
this kind. Installed in a major library, could respond to requests from individ- 
ual blind subscribers by providing direct listening access to, or recordings 
of, clearly intelligible synthetic speech from ordinary printed texts. These 
recordings can be made at rates much faster than a human speaker can produce 
them. Hence, the availability of a fast, library-based reading machine service 
could make a substantial contribution toward meeting the educational, recrea- 
tional, and vocational needs of blind people. 

Status of the Research 

A prototype reading system has been constructed at Hasklns Laboratories 
and has been in operation for nearly a year. Continuing efforts are being made 
to improve the performance of the machine at different levels. High on the 
list of activities during recent months have been the introduction of improve- 
ments in the quality of the speech and the incorporation of an optical charac- 
ter recognition (OCR) machine into the system to provide for the input of type- 
written texts. 

Looking ahead to the eventual deployment of a reading machine system, a 
collateral study which has gathered momentum during the past six months has 
focused attention on the Intelligibility, comprehensibility, and acceptability 
of synthetic speech. The data from these tests are intended to show where 
efforts on speech improvement should be concentrated, and to test the reliability 
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of the reading machine system. If, as expected, the results of these studies 
confirm the feasibility and utility of an automated reading system from both 
technical and user standpoints, the resources might then be found to build a 
Pilot Reading Service Center. This Center would provide an experimental service 
to the blind community in its area and would act as a model on which other 
regional centers could later be based. 

Installation of the OCR Equipment 

The Cognitronics System/70 optical character recognition equipment, which 
the Lah')ratories purchased with funds from The Seeing Eye Foundation, was put 
into operation in mid-March 1973. Since then, as a first step, an output pro- 
gram has been developed to punch a paper tape copy of the typewritten pages 
automatically scanned by the optical reader. This tape is then read by the 
DDP-224 computer which performs the remainder of the processing required to 
generate synthetic speech. As described in more detail in earlier reports, ^ the 
DDP-224 computer — using a phonetic dictionary — converts the orthographically 
spelled text received via the tape reader into phonetic text. During the con- 
version of the text into phonetic form, stress and intonation markers are intro- 
duced in readiness for speech synthesis. If all the words contained in the 
original text have been found in the dictionary and if the punctuation available 
in the original text has provided an adequate guide to the insertion of intona- 
tion marks, synthesis proceeds automatically. However, editorial intervention 
is sometimes required and provision has therefore been made, just prior to 
sjm thesis 3 for an editor to check the dictionary output. New words are contin- 
ually being added to the dictionary which now contains over 150,000 entries. 

The use of the paper tape medium to convey texts from the optical reader 
to the main computer has been adopted merely as an, interim measure. Work is in 
hand on the design and implementation of a direct electrical connection between 
the Cognitronics reader and the DDP-224 computer. This connection will permit 
rapid conversion of the fairly large volumes of text required for evaluation 
purposes — particularly for those requiring acceptability and tolerability judg- 
ments. One such evaluation project (for which the system is currently being 
readied) involves the regular conversion of articles from a New Haven daily 
newspaper into synthetic speech and the subsequent appraisal by blind veterans 
at the Veterans Administration Hospital in West Haven, Conn. The New Haven 
Register provides the Laboratories with punched paper tapes of an article. The 
PDP-8 computer (which is an integral part of the OCR reader) is then used to 
recode the text so that the DDP-224 computer can read it and perform the speech 
synthesis. 

Synthetic Speech Evaluation 

In the area of evaluation, recent activity has concentrated on an analysis 
of the data obtained from a closed response version of the Fairbank's Rhyme Test 
and on the administration and analysis of a new test procedure using meaningless 
sentences. (The absence of meaning makes the recognition of words in continuous 
speech much more difficult.) 
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The Modified Rhyme Test, described in an earlier report,'^ was administered 
in synthetic speech to thirty inexperienced sighted students and six blind 
students at the University of Connecticut. As a control test, the words were 
presented in natural speech generated by a single speaker. Three hundred mono- 
syllabic words were used in six different orders of presentation. The overall 
intelligibility scores were found to be 92.5% for synthetic speech and 97.3% 
for natural speech — the former indicating needed synthesis improvements and the 
latter agreeing well with the data obtained by other workers. Initial /v/ and 
final It I in particular — as well as the labial » labiodental, and dental frica- 
tives in general — were isolated as the least intelligible phones. However, an 
intrinsic limitation of the Modified Rhyme Test is that individual consonants 
are presented an unequal number of times, in unequal '^owel environments, and in 
an imbalanced proportion of initial versus final syllable positions. The sub- 
jects' ability to recognize words in synthetic speech was shown to improve con- 
sistently over the course of the tests; thus it is possible that the low occur- 
rence of some phones may have contributed to their low intelligibility scores. 
The finding that a listener's performance with synthetic speech improves with 
experience is consistent with the observations of many other workers. Custom- 
arily, the best scores are obtained if the "training period" with synthetic 
speech extends over several hours. However, this period is very short compared 
with the learning time demanded by nonspeech reading aids. In view of the fact 
that it is intended to be used on large volumes of reading matter, the modest 
amount of learning required in no way lessens the potential usefulness of syn- 
thetic speech in a Reading Service Center application. 

The latest test conducted in the evaluation program utilized 126 nouns, 63 
adjectives, and 63 past-tense verbs — all monosyllables selected from the 2,000 
most frequently used words in English* Words from each category were randomly 
selected to create 200 meaningless sentences of the grammatical , form exemplified 
in this sentence: "The gold rain led the wing." These sentences were recorded 
in both naturally spoken and synthesized speech in batches of 50 sentences with 
a 10-sec interval between each sentence. During that interval the 32 sighted 
test subjects were required to write down the sentence they had heard in ordi- 
nary English orthography. Lacking semantic context cues, the test proved to be 
the most difficult yet administered. A full phoneme-by-phoneme analysis of the 
natural speech and synthetic speech errors made by each subject has been under- 
taken to discover not only the most common confusions made but also the phonetic 
environments in which the errors occurred. A large volume of data has been ob- 
tained and the concluding phase of the analysis is still in progress. Discussion 
of the results will appear in the next report. 

Experiments in Alphabetic-to-Phonetic Conversion 

While the reading machine output is being evaluated with a view to early 
deployment, research efforts are continuing toward the improvement of the speech 
output. By deliberate choice, the current methods of assigning and modifying 
stress in the phonetic string are simple and direct. The results, however, 
while clearly superior to what might be expected if stress and intonation were 
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totally absent, are not entirely natural. The problem of improving the intona- 
tion patterns in the speech output has two parts. One part involves the obser- 
vation of natural speech patterns and the determination of rules relating these 
patterns to the syntactic and lexical content of the sentences. A second part 
involves the development of a flexible Experimental Synthesis Program in which 
the rules governing acceptable stress and intonation may be examined. 

Comparison of the current synthesized output with samples of natural speech 
have recently led to new experiments involving the use of increased vowel dura- 
tion as a supplemental cue for stress. In the Experimental Synthesis Program 
now being written, when a syllable is chosen to be stressed its lexical vowel is 
mapped into the phonetic output string as a diphthong or occasionally as a triph- 
thong. This increase in formant excursion is applied in addition to the usual 
pitch excursion. When, on the other hand, syllables are marked for low stress, 
the vowels are in general mapped into the single vowel shwa. All syllables 
moving from the phonetic dictionary to the output are additionally marked accord- 
ing to whether they occur in phrase-final position or not (a phrase in this 
cense being indicated by an intonatiorj contour synibol) . Thus, in th^ last 
phrase before a final end-of- intonation pause, trailing resonant phones as well 
as central vocalic phones are protracted. This gives a partially filled-pause 
effect which, together with the normal distinctive pitch excursion, highlights 
the conclusion of the phrase. 

With these methods, certain prosortic features observed in natural speech 
are emphasized in the synthetic output in sharp phonetic relief. This appears 
to increase the intelligibility of sentences, although at the expense of , natural- 
ness. Further investigation will be required to obtain a satisfactory balance 
among the various cues that convey acceptable stress and phrasing within a syn- 
thesized sentence. 

Speech Synthesis 

A new OVE T.I cascade formant synthesizer has recently been installed as the 
output stage of the reading machine system. The OVE replaces a parallel reso- 
nance synthesizer that was built at the Laboratories several years ago. While 
in many respects less flexible as a general speech research tool, the new sjm- 
thesizer has its formant filters connected in series. This arrangement is 
better suited to a reading machine application since it establishes automati- 
cally the correct relative formant energy levels and reduces significantly the 
amcunt of calculation performed by the computer during the production of vowels. 
Synthesis programs designed for the OVE have been in operation since February 
1973 and the device is already producing speech which seems better than that 
from the older model. 

However, one of the most striking deficiencies of the OVE sjmthesizer is 
its limited performance on the production of nasals. The OVE has one parallel 
nasal resonator available, but spectrographic analyses of natural speech suggest 
that additional resonances and antiresonances may be needed. At present, an 
investigation is in progress to discover the extent to which the perception of 
nasality can be enhanced within the limitations of the existing hardware. A 
search is also being actively pursued to find ways in \diich additional compo- 
nents, designed to generate the appropriate spectra, can be added. The results 
of this enquiry promise to provide a substantial improvement in voice quality. 
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ABSTRACT 

An Examination of Hemispheric Asymmetry in the Visual Processing of Linguistic 
Items* 

Claire Farley Michaels"*" 



The two cerebral hemispheres differ in their linguistic capabilities; in 
the processing of language by ear and by eye the left hemisphere is clearly 
superior to the right. Two aspects of the left-hemisphere/right-visual-field 
superiority for verbal material were examined: the earliest processing stage at 
which the superiority appears and the characteristics of verbal material that 
are relevant to that superiority. 

One approach sought to determine whether hemispheric differences were to be 
found in the latencies required for the detection of phonological lawfulness and/ 
or the detection of lexical membership. These two processes are assumed to 
underlie performance in the word/nonword classification task. In three experi- 
ments subjects were presented lateralized letter trigrams for classification as 
''word" or "nonword." No hemispheric differences were observed. The absence of 
asymmetry in this task could be attributed to apparent changes in processing 
mode during the course of an experiment. In addition, the spatial (right hemi- 
sphere) nature of the manual response signaling the classification may have ob- 
scured any left-hemisphere superiority. 

Another approach to the aspects of left-hemisphere superio ity took advan- 
tage of visual masking, Dichoptic masking functions trace out a J-shaped rela- 
tion between target identif lability and the time elapsed between target and mask 
onset. The descending portion of such functions can be said to describe masking 
arising in an icon-construe cion stage while the ascending part describes masking 
originating in an icon-identifying stage. A left-hemisphere superiority was 
detected only in the iatter case and this superiority held for both words and 
nonwords. A more detailed analysis revealed that the left hemisphere responded 
more than the right to a meaning distinction (e.g., CAT vs. CAG) but less than 
the right to a number-of-syllables distinction (e.g., CAG vs. CKG) . These obser- 
vations were discussed in the context of current conceptualizations of the rela- 
tion between language and brain. 
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INTRODUCTION 



Auditory and Phonetic Levels of Processing in Speech Perception 

Current theories of speech perception suggest that the process by* which a 
listener understands the linguistic content of a spoken message consists of sev- 
eral distinct conceptual levels or stages. Although different levels have re- 
ceived primary emphasis from different authors, there is general agreement that 
any satisfactory account of speech perception must include at least auditory, 
phonetic, phonological, syntactic, and semantic levels (see, for example. Fry, 
1956; Fant, 1967; Liberman, 1970; Stevens and House, 1972; Studdert-Kennedy, in 
press). Typically, these levels are considered to be organized in a hierarchical 
manner, with the input to one level roughly corresponding to the output of the 
previous level in the hierarchy. However, since there is ample evidence that 
constraints at one level can modify operations at both lower and higher levels 
(cf. Studdert-Kennedy, in press), an adequate description of the speech percep- 
tion process must also account for these interactions between levels. 

The present research was concerned with the auditory and phonetic levels in 
such a conceptual hierarchy; that is, those perceptual processes which intervene 
between the acoustic speech signal and a stage roughly corresponding to the iden*^ 
tification of phonemes. These two levels can be isolated experimentally from 
syntactic and semantic levels by investigaLing the perception of isolated phonet- 
ic stimuli in the form of nonsense syllables. While an intuitive distinction be- 
tween auditory and phonetic levels of processing has been evident in a number of 
theories of speech perception (Fant, 1967; Liberman, Cooper, Shankweiler, and 
Studdert-Kennedy, 1967; Stevens and Halle, 1967; Stevens and House, 1972), this 
distinction has been made most explicit in the recent work of Studdert-Kennedy 
(Studdert-Kennedy and Shankweiler, 1970; Studdert-Kennedy and Hadding, 1971; 
Studdert-Kennedy, Shankweiler, and Pisoni, 1972; Studdert-Kennedy, in press) . 

The auditory level is characterized by its direct relation to the acoustic 
signal. This level is assumed to consist of those neural processes which analyze 
the acoustic input into a set of auditory parameters. "It is automatic, that is 
beyond voluntary control: It transforms the acoustic waveform that impinges on 
the ear to some time-varying pattern of neurological events of which the spectro- 
gram is, at present, our closest symbolic representation" (Studdert-Kennedy, in 
press). Thus, the auditory level may be characterized as that portion of the 
speech perception process that is nonlinguistic, and therefore includes those 
mechanisms that operate on speech and nonspeech signals alike. 

In contrast to the direct relationship between the auditory level and the 
acoustic signal, the phonetic level is best characterized by its abstractness. 
That is, the phonetic features which are assumed to be the output of the phonetic 
level are not inherent in the speech signal, but rather are abstract linguistic 
entities. Instead of a one-to-one mapping of sound to in the fashion of 

an alphabet, the phonemes are linked to the acoustic signal by a complex set of 
transformations which have been called the "speech code" (Liberman et al., 1967). 
The specific details of this code and the empirical data on which it is based 
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have been thoroughly reviewed elsewhere (Liberman et al., 1967; Mattingly and 
Liberman, 1969; Liberman, 1970; Cooper, 1971; Liberman, Mattingly, and Turvey, 
1971), and will not be considered in detail here. However, it will be useful to 
consider briefly the most Important single characteristic of the speech code: 
the parallel transmission of phonetic information in the acoustic signal. 

Parallel transmission of phonetic information is accomplished in three gen- 
eral ways. 1) A single acoustic segment in the speech signal can carry informa- 
tion simultaneously about two or more successive phonetic segments. Therefore, 
there is often little correspondence between acoustic segments in the signal and 
phonetic segments in the message. 2) Information about multiple phonetic fea- 
tures of the same phonetic segment is transmitted simultaneously in time by dif- 
ferent portions of the speech signal. 3) The contrast between values of a given 
phonetic feature may be cued by a number of different acoustic parameters, both 
simultaneously and successively in time. Thus, information about perceptually 
distinct phonetic segments is completely merged and intermixed in the acoustic 
signal. This characteristic "encoding" of phonetic informal -.on in the speech 
signal provides speech communication with its great efficiency, but at the same 
time requires specialized perceptual mechanisms, in addition to the auditory sys- 
tem, which are capable of decoding the phonetic content from the speech signal 
(Liberman et al., 1967). 

To summarize, the nonlinguistic auditory level is assumed to perform a pre- 
liminary analysis of the acoustic speech signal, resulting in a set of auditory 
parameters that are the neural representation of that signal. This process is 
not unique to speech but operates on any acoustic signal within the audible range. 
In contrast, mechanisms at the phonetic level are assumed to perform the linguis- 
tic decoding process by which the particular complex of acoustic cues for a given 
phonetic feature is extracted from the results of the auditory analysis. On 
logical grounds alone it has been argued that this decoding process requires 
specialized neural mechanisms in addition to the general auditory system (Liberman 
et al., 1967; Mattingly and Liberman, 1969; Liberman, 1970; Mattingly, Liberman, 
Syrdal, and Halwes, 1971), and empirical evidence to be described below is con- 
sistent with this suggestion. Thus, the auditory level is assumed to correspond 
roughly to the general auditory system of man and other primates, while the pho- 
netic decoding process is attributed to the existence in the human brain of addi- 
tional neural mechanisms specialized for speech perception. 

Differences Between the Perception of Speech and Nonspeech Stimuli 

Much of the evidence supporting a distinction between auditory and phonetic 
levels of processing has come from differences in the way speech and nonspeech 
stimuli are perceived; in particular, 1) tendencies toward "categorical" versus 
"continuous" perception (for recent reviews see Studdert-Kennedy, Liberman, 
Harris, and Cooper, 1970a; Pisoni, 1971; Mattingly et al., 1971), and 2) ear ad- 
vantages in dichotic listening (for a recent review see Studdert-Kennedy and 
Shankweiler, 1970). 

Categorical perception refers to the tendency for the discrimination of cer- 
tain speech sounds to be a function of phonetic categories rather than a function 
of physical differences between stimuli. Listeners can easily discriminate be- 
tween two stimuli selected from different phonetic categories, but they cannot 
discriminate between two stimuli selected from the same phonetic category even 
though the physical difference between the two stimuli is equal in both cases. 
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In other words, for these stimuli, differential discrimination is limited by ab- 
solute identification. In contrast, differential discrimination among levels of 
a nonspeech dimension such as pitch is far better than absolute identification, 
with well over 1,000 different values discriminated differentially, and absolute 
identification limited to the familiar seven, plus or minus two. Thus, while the 
discrimination among levels of a nonspeech dimension is related to the physical 
stimulus dimension, the discrimination among certain speech stimuli is limited to 
a large extent by the linguistic knowledge of the listener. 

.A second important difference between the perception of speech and nonspeech 
stimuli is the nature of ear advantages in dichotic listening. The basic finding 
in such experiments is that when pairs of speech stimuli are presented simultan- 
eously to the two ears (i.e., dichotically) , those stimuli presented to the right 
ear are more accurately identified than those presented to the left ear: a 
"right--ear advantage.*' In contrast, for nonspeech stimuli the typical result has 
been either no ear advantage or an advantage in favor of the left ear. Kimura 
(1961a) , who originally reported the right-ear advantage for dichotically presen- 
ted digits, attributed it to two factors: 1) the prepotency of the auditory 
pathways from each ear to contralateral auditory cortex demonstrated by 
neurophysiological experiments; and 2) the predominant lateralization of language 
function in the left hemisphere demonstrated by the analysis of language dis- 
orders following damage to the left and right hemispheres (Milner, 1967; 
Geschwlnd, 1970). If this interpretation were correct, then subjects with known 
language dominance of the right hemisphere should show a left-ear advantage for 
speech in a comparable experiment. This result was Indeed obtained (Kimura, 
1961b) . Although the initial dichotic listening experiments employed digits and 
other meaningful verbal stimuli, subsequent experiments have clearly shown that 
the right-ear advantage is not dependent upon higher level syntactic or semantic 
processes. Some of the largest and most reliable right-ear advantages have been 
obtained using isolated consonant-vowel syllables (Studdert-Kennedy and 
Shankweller, 1970). A number of alternative explanations to the Kimura model of 
the right-ear advantage have been suggested, including response bias, attentlonal 
bias, and order of report effects. However, the right-ear advantage is still ob- 
tained in paradigms that eliminate these alternative explanations. 

Differences Between the Perception of Auditory and Phonetic Dimensions of the 
Same Speech Stimuli 

Another approach to the distinction between auditory and phonetic levels of 
processing has been recently reported by Day and Wood (1972a) and Wood, Goff , and 
Day (1971) . Instead of concentrating upon differences between the perception of 
speech and nonspeech stimuli, these experiments compared the perception of audi- 
tory and phonetic dimensions of the same speech stimuli. Thus, while previous 
experiments compared auditory and phonetic processing by varying the nature of 
the acoustic stimuli, the experiments of Day and Wood (1972a) and Wood et al. 
(1971) presented a single set of speech stimuli and varied the nature of the pro- 
cessing task. Since these Initial experiments form the immediate background for 
the present research, they will be described in some detail. 

The experimental paradigm used by Day and Wood (1972a) was a two-choice 
speeded-classlf Ication task similar to that employed by Gamer and Felfoldy (1970) 
to study Interactions between stimulus dimensions in information processing. Sub- 
jects were presented series of sjmthetic consonant-vowel (CV) syllables that 
varied between two levels on a given target dimension, and were required to 
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identify which level of the target dimension occurred on each trial. The two 
stimulus dimensions compared in this experiment were: 1) a phonetic dimension, 
place of articulation of voiced stop consonants; and 2) an auditory dimension, 
fundamental frequency or pitch. For convenience, these two dimensions vrill be 
referred to as Place and Pitch, respectively* Reaction time (RT) for the identi- 
fication of each dimension was measured under two conditions: 1) a single-dimen- 
sion control condition, in which only the target dimension varied in the stimulus 
sequence; and 2) a two-dimension orthogonal condition, in which both the target 
dimension and the irrelevant nontarget dimension varied orthogonally in the stim- 
ulus sequence. For each dimension the only difference between the control and 
orthogonal conditions was the presence or absence of irrelevant variation in the 
nontarget dimension. Therefore, a comparison of the RTs from thesa two condi- 
tions indicated the degree to which each dimension was processed independently of 
irrelevant variation in the other dimension. 

Three possible interactions between dimensions could be obtained in this 
paradigm. 1) Irrelevant variation in each dimension could produce interference 
with the processing of the other. This result is typical of that obtained for 
"integral" stimulus dimensions by Garner and Felf oldy (1970) . Such a result 
would suggest that the two dimensions were automatically extracted by a single 
perceptual process, or by multiple processes which are strongly dependent upon 
each other. 2) A second possibility is that irrelevant variation in neither 
dimension could interfere with the processing of the other. This result is 
typical of that for "nonintegral" dimensions in the experiment of Garner and 
Felf oldy (1970) . Such a result would suggest that the perceptual processes for 
the two dimensions are largely independent. 3) The final possibility is that the 
interference between dimensions could be un id ir e c t iona 1 ; that is, irrelevant 
variation in one dimension could interfere with the processing of the other, but 
not the reverse. This result would also imply that multiple processes are in- 
volved in the extraction of the two dimensions. However, the unidirectional 
interference would imply a dependence of the processes for one dimension upon 
those for the other. 

The results of Day and Wood (1972a) for the Place and Pitch dimensions 
followed the third h3rpothetical pattern described above. For identification of 
Place, there was a substantial increase in RT from the control to the orthogonal 
condition, indicating that irrelevant variation in Pitch significantly interfered 
with the processing of Place. In contrast, there was only a slight increase in 
RT for Pitch, indicating that subjects could ignore or filter the Place dimension 
to a considerable degree when required to process Pitch# These results suggest 
that different levels of processing underlie the identification of auditory and 
phonetic dimensions of the same speech stimuli. In addition, they suggest that 
the phonetic level processes are in some way dependent upon those performed by 
the auditory level. 

Neurophysiological evidence for a distinction between auditory and phonetic 
levels of processing was obtained in a related experiment by Wood et al. (1971) • 
This experiment used the same basic strategy as the RT experiment of Day and 
Wood (1972a), by comparing neural activity during the identification of auditory 
and phonetic dimensions of the same speech signal. Averaged evoked potentials 
produced by the same synthetic CV syllable were recorded over the left and right 
hemispheres during Place and Pitch identification tasks similar to those of the 
control condition in the experiment of Day and Wood (1972a). In this way, neural 
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activity that occurred during the Place and Pitch tasks could be compared 
directly, without differences between tasks in the acoustic speech signal or its 
presentation probability, in the subjects' motor responses, in RT, or in the record- 
ing apparatus. These controls were necessary to eliminate the possibility that 
obtained evoked potential differences between tasks were produced by factors 
other than the perceptual processes required for each task. 

If the processing of auditory and phonetic dimensions of a speech signal 
were accomplished by a single perceptual process, then evoked potentials during 
the Place and Pitch tasks would merely be random samples from a single population 
and should differ only by sampling fluctuations. In the time interval between 
the onset of the speech signal and subjects' identification responses, evoked 
potentials at locations over the right hemisphere were indeed identical for both 
tasks. However, significant differences in evoked potentials were obtained at 
left-hemisphere locations during the same time interval. These results indicate 
that different neural events occur in the left hemisphere during the identifica- 
tion of auditory and phonetic dimensions of the same acoustic signal. 

Thus, the neurophysiological data of Wood et al. (1971) provide additional 
support for the distinction between auditory and phonetic levels of processing 
demonstrated by the RT experiment of Day and Wood (1972a). Both experiments 
suggest that the identification of Place involves an additional level of process- 
ing that is not required for the identification of the Pitch of the same speech 
signal. In addition, the evoked potential data relate this additional level of 
processing to the concept of hemispheric specialization for speech perception, 
derived from the analysis of language disorders following brain damage and the 
results of dichotic listening experiments. 

Rationale for the Present Experiments 

Together ^d.th the categorical perception and dichotic listening experiments 
described above, the experiments of Day and Wood (1972a) and Wood et al. (1971) 
provide a strong set of converging operations (Garner, Hake, and Eriksen, 1956) 
upon the distinction between auditory and phonetic levels of processing in speech 
perception, and upon the idea that the phonetic level processes are performed by 
speciali^^ed neural mechanisms which are lateralized in one cerebral hemisphere. 
The present investigation was a direct extension of the experiments of Wood et al. 
(1971) and Day and Wood (1972a) (hereafter called the initial experiments) and 
had three main purposes: 

a) To specify in greater detail the nature of the acoustic stim- 
uli and processing tasks responsible for the neurophysiological and 
RT results of the initial experiments. Such information would pro- 
vide further substantiation for the basic distinction between audi- 
tory and phonetic levels, allow a more detailed understanding of the 
specific functional operations performed by each level, and clarify 
the nature of interactions between levels. 

b) To make a stronger test of the convergence of the RT and 
neurophysiological findings upon the distinction between auditory and 
phonetic processing. Although it was suggested above that both sets 
of results reflect the same underlying difference between auditory 
and phonetic levels of processing, the two initial experiments did 
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not use identical paradigms or stimulus sets. These differences re- 
flect methodological constraints imposed by the distinct backgrounds 
from which each experiment was derived. If both experimental opera- 
tions actually do converge upon the single concept of specialized 
phonetic mechanisms distinct from the general auditory system, then 
it should be possible to obtain both the neurophysiological and RT 
findings in a single experiment. In addition, both response measures 
shoulu be entirely consistent over a range of acoustic stimuli and 
identification tasks which differ in the aspect of whether or not 
phonetic processing is required. 

c) To obtain more information concerning the characteristics of 
the neural activity that occurs during phonetic processing. Although 
Wood et al. (1971) clearly demonstrated that different neural events 
occur during phonetic and auditory processing, that experiment pro- 
vided relatively little information concerning the characteristics of 
those events or their relation to neurophysiological correlates of 
other perceptual phenomena. 

The present investigation consisted of four experiments, each of which com- 
pared the perception of two dimensions of the same speech stimuli. In all four 
experiments a two-choice identification paradigm similar to that of Day and Wood 
(1972a) combined the methodology of the RT and evoked potential experiments so 
that both response measures could be obtained in each experiment. One stimulus 
dimension. Pitch, was identical in all experiments, in order to provide the same 
auditory processing task as a common baseline in all four experiments. Indiv- 
idual experiments differed from the others only in the second stimulus dimension 
compared to Pitch. The nature of the information carried by this second dimen- 
sion and its status as linguistic or nonlinguistic constituted the principal man- 
ipulation across experiments. 

Experiments 1 and 2 sought to provide further validation of the initial RT 
and evoked potential experiments which distinguished between auditory and phonet- 
ic levels of processing. Experiment 1 was a replication of the comparison be- 
tween the Place and Pitch dimensions made in the initial experiments, while 
Experiment 2 was a control experiment comparing Pitch with another auditory di- 
mension. Intensity. Experiments 3 and 4 were designed to provide more specific 
information about the acoustic stimuli and processing tasks responsible for the 
differences between phonetic and auditory dimensions. Both experiments investi- 
gated dimensions with different degrees of approximation to the phonetic dimen- 
sion Place used in Experiment 1 and the initial experiments. Experiment 3 ana- 
lyzed the acoustic cue for the Place distinction, the second formant transition, 
in isolation rather than in phonetic context as in Experiment 1. Experiment 4 
analyzed Pitch Contour, which is basically an auditory dimension, but which 
under the appropriate conditions can cue a linguistic distinction. 
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METHOD 



The Basic Paradigm Common to All Experiments 

Subjects were presented series of trials consisting of synthetic CV sylla- 
bles which varied along a given target dimension. On each trial subjects were 
required to identify which of two possible levels on the target dimension had 
occurred by pressing one of two response buttons as rapidly as possible. Sub- 
jects' identification accuracy, identification speed (RT) , and averaged evoked 
potentials time-locked to the acoustic stimulus, were recorded on each trial. 
Identification of the two dimensions was measured under two conditions: 1) a 
single-dimension control condition, in which only the tarpat dimension varied in 
the stimulus sequence; and 2) a two-dimension orthogonal condition in which both 
the target dimension and the irrelevant nontarget dimension varied orthogonally 
in the stimulus sequence. 

A general form of this 2x2 paradigm is shown in Table 1 for the hypothet- 
ical stimulus dimensions A and B. Each dimension has two levels, 1 and 2, and 
the stimuli for each experiment consist of all possible combinations of the two 
levels of each dimension (i.e., Aj^B]^, A]^B2, A2B2^, and A2B2) . The set of possi- 
ble stimuli for each cell in the paradigm was selected from this complete set of 



TABLE 1: General form of the 2x2 paradigm used in each experiment. 
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four. For example, consider the case in which dimension A is the target dimen- 
sion and dimension B is the nontarget dimension shown in the top half of Table 1. 
In the orthogonal condition shown on the right, both dimensions vary orthogonally 
and the stimulus set for this cell consists of all four stimuli. In the control 
condition shown on the left, only the target dimension A varies and the nontarget 
dimension B is held conaLant at one of its two levels. Notice that there are two 
ways in which the nontarget dimension can be held constant: either at level 1 
shoxm in the upper half of this cell, or at level 2 shown in the lower half. In 
order to allow each of the four stimuli to occur an equal number of times in each 
cell, two separate blocks of trials were presented for each cell. In the control 
condition the stimulus sets for these two blocks are shown in the upper and lower 
halves of the control cells in Table 1. In the orthogonal condition there is 
only one possible stimulus set, and the two blocks of trials for this condition 
were repetitions of this same set. The lower half of Table 1 shows the corres- 
ponding stimulus sets for blocks of trials in which dimension B is the target 
dimension. By constructing the stimulus sets in this manner, it was possible for 
each of the four stimuli to occur an equal number of times both within and be- 
tween cells. 

The use of the design shown in Table 1 allowed both the RT and the evoked 
potential data to be obtained in each experiment. As described above, the degree 
to which each dimension could be processed independently of irrelevant variation 
in the other was analyzed by comparing RTs in the control and orthogonal condi- 
tions. For the RT data, this paradigm was virtually identical to that of Day and 
Wood (1972a). The analysis of neural activity comparable to that of Wood et al. 
(1971) was made by comparing evoked potentials recorded during the control condi- 
tion for each dimension. As described in detail above, this comparison requires 
that the evoked potentials for each dimension be equated for factors such as the 
acoustic stimuli, their presentation probability, subject *s motor response, RT, 
electrode locations $ and the recording apparatus. Only under these circumstances 
may obtained differences in evoked potentials be attributed to differences in the 
perceptual processes for each dimension. 

Subjects 

Six male and six female paid volunteers aged 19-24 served as subjects in all 

four experiments. All subjects were right handed and had no history of hearing 
difficulty. 

Stimuli 

The four acoustic stimuli used in each experiment were generated by the 
Raskins Laboratories* parallel resonance synthesizer and edited under tho 
Raskins* executive system (Mattingly, 1968). In Experiments 1, 2, and 4, the 
synthetic stimuli were two-formant CV syllables of 300 msec duration. In 
Experiment 3, the stimuli were portions of the full CV syllables, the second for- 
mant transitions. The specific set of stimuli for each experiment will be de- 
scribed below. The four stimuli for each experiment i^'ere precisely equated for 
all acoustic parameters except for the two dimensions that were explicitly varied. 

From the synthesizer each stimulus was digitized and stored on a magnetic 
disc via the Raskins Laboratories* pulse-code modulation system (Cooper and 
Mattingly, 1969). A five-channel stimulus tape was prepared for each experiment, 
consisting of four stimulus channels and one channel of trigger pulses. A series 
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of 64 trigger pulses was first recorded on magnetic tape at 5-sec intertrial 
intervals by a Precision Instrument FM tape recorder (frequency response: 
± 0.5 db» DC to 10 kHz at 30 ips) . One of the synthetic stimuli was then played 
into the atialog-to-digital converter of a PBP-12 computer, digitized at a sam- 
pling rate of 60 ysec, and stored in core memory. The prerecorded trigger pulses 
were used to trigger the computer, causing the stimulus in memory to be played 
through ^he digital-to-analog converter and recorded on another channel of the 
analog tape. This process was repeated for the remaining three stimuli for that 
experiment, resulting in a stimulus tape with the four stimuli and the trigger 
pulses occurring 64 times in parallel on separate channels all synchronized to 
60 ysec accuracy. Any of the four possible stimuli could therefore be presented 
on a given trial by connecting the appropriate channel of the tape recorder to 
the subject's earphones. 

Apparatus 

Subjects were seated comfortably in an Industrial Acoustics Corporation 
sound-attenuating and electrically shielded chamber which was illuminated at mod- 
erate intensity. The stimuli were presented binaurally from the tape recorder to 
G. C. Electronics earplug-t3rpe earphones through a Grason-Stadler Model 829D 
electronic switch at 65 db SL against a constant 30 db SL masking noise (Grason- 
Stadler Model 701 noise generator). Both of these values were determined individ- 
ually for each subject prior to the experiment. Intensities of the synthetic 
stimuli and white noise were separately controlled by Hewlett Packard Model 350B 
two-channel decade attenuators in series with the tape recorder and white noise 
generator. 

The electroencephalogram (EEG) was recorded using a Grass Model 7 polygraph 
with Grasr, Model P511 wide-band AC EEG preamplifiers (system gain: 2 x 10^) and 
was monitored visually throughout each block of trials. Half-amplitude low- and 
high-frequency settings of the amplifiers were 0.1 Hz and 300 Hz, respectively. 
Scalp recordings were made using Glrass silver disc electrodes from two symmetri- 
cally located locations over each hemisphere, each referred to a linked ear ref- 
erence. The scalp locations were T3 and C3 over the left hemisphere and T4 and 
C4 over the right hemisphere according to the International 10-20 system (Jasper, 
1958). Subjects were grounded through a plate attached to the left wrist. The 
impedance of all electrodes was monitored carefully throughout each recording 
session and maintained at less than 4.0 kohms at 10 Hz. Particular care was 
taken to equalize impedances of the two ear reference electrodes. 

The amplified EEG from the four scalp locations (T3, C3, C4, T4) was entered 
into a LINC computer for on-line analog-to-digital conversion and signal averag- 
ing. The LINC sampling epochs were 490 msec long with 256 time points per sam- 
pling epoch. Three sampling rates were used in the 490 msec epoch : 1 point 
every 0.5 msec for the first 60 points, 1 point every 1 msec for the next 66 
points, and 1 point every 3 msec for the remaining 130 points. The LINC stored 
the resulting averaged evoked potentials on digital magnetic tape separately for 
each stimulus in each block of trials. In addition to this on-line processing, 
all channels of EEG together with pulses synchronized with stimulus onset and 
the subjects' identification responses were recorded onaHoneywell Model 8100 FM 
tape recorder (frequency response: ± 0.25 db, DC to 625 Hz at 3-3/4 ips) for 
subsequent off-line data analysis. 
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The LING also controlled stimulus presentation order and recorded the sub- 
ject's reaction time. On each trial the LING read a stimulus code from paper 
tape and closed a relay to present the stimulus specified by that code to the 
subject's earphones. Separate paper tapes were made for each cell in the 2x2 
paradigm shown in Table 1, each tape containing a different constrained random 
order of the possible stimuli for that cell. The random orders were determined 
by reference to a random number table with the constraint that each possible 
stimulus for a given block of 64 trials occur an equal number of times in that 
block, with no runs of the same stimulus longer than five. No subject ever re- 
ceived the same random order in any cell across all four experiments. Reaction 
time was recorded to 1 msec' accuracy using a counter-timer (Beckman-Berkeley 
Model 7531R) which was triggered at stimulus onset and halted by the subject's 
button-press response. The LING then read and translated the RT from the 
counter-timer and punched the stimulus code and RT for that trial on paper tape 
for later analysis. 

Procedure 

Before beginning the experiments, each subject served in an initial orienta- 
tion session to become familiar with the apparatus and experimental procedures. 
The methods of EEG and evoked potential recording were explained in some detail 
and subjects received practice in all conditions of the first experiment they 
were scheduled to receive. The practice trials were presented under conditions 
identical to the actual experiment and allowed subjects to stabilize performance 
in the identification tasks. In addition, these practice trials allowed the ex- 
perimenter to assess subjects for two criteria that each subject had to meet in 
order to participate in the actual experiments: 1) they had to perform the iden- 
tification tasks accurately with mean RTs of less than 600 msec in the control 
conditions; and 2) they had to perform the tasks so that stable EEG and evoked 
potentials could be recorded without muscle or movement artifact. Two potential 
subjects out of the initial 14 tested did not meet these criteria and were ex- 
cused. One subject failed to meet the RT criterion, while the other showed a 
large movement artifact which could not be eliminated. The remaining 12 subjects 
met both criteria and were continued for the remainder of the experimental pro- 
cedure. 

Since each subject participated in all four experiments, they received the 
experiments in an order specified by a balanced latin square to control for pos- 
sible effects of presentation order. For this purpose the 12 subjects were 
placed in four groups of three, with a different order of the four experiments 
specified for each group by the latin square. Each experiment was given in a 
separate session, with at least two days intervening between sessions. To con- 
trol for possible effects of presentation order within a given experiment, the 
four cells in the 2x2 paradigm were also presented in an order specified by a 
balanced latin square. Each subgroup of three subjects was assigned a within- 
experiment presentation order according to the latin square and received this 
same order in all four experiments. 

A session consisted of eight blocks of 64 trials (two blocks of trials 
for each cell of the paradigm shown in Table 1) . The first four blocks of 
trials in a given session consisted of the order specified by the latin square, 
and this order was then reversed for the remaining four trials in that session. 
For the orthogonal condition the two blocks of trials for each cell were identi- 
cal as described above, since all four possible stimuli occurred in random order. 
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For the control condition the nontarget dimension was held constant at level 1 
on one block of trials and at level 2 on another block of trials, as shown in 
Table 1. For half the subjects the nontarget dimension was held constant at 
level 1 during the first block of trials and at level 2 during the second block 
of trials, while the other half of the subjects received the levels of the non- 
target dimension in the reverse order. Thus, each subject received two blocks 
of 64 trials in each cell of each experiment, with order of presentation deter- 
mined both within and between experiments by a latin square. 

At the beginning of each session, the four stimuli for that experiment were 
presented in a fixed sequence and repeated until the subject reported that they 
could be easily distinguished. This rarely required more than two or three repe- 
titions of the sequence for any subject. Response buttons were then assigned to 
levels on each dimension in the following manner. One dimension was held con- 
stant and the two levels on the other dimension were presented in alternating 
order, beginning with the level for button 1. When the subject reported that he 
knew the correct button for each level, the stimuli were presented in random 
order until a criterion of eight consecutive correct responses was attained. 
This procedure was then repeated to assign responses to the levels on the second 
dimension. 

After mastering the button assignments, subjects received one practice block 
of 64 trials for each cell, in the same order as they would be received in the 
actual experiment. In addition, each block of trials in the actual experiment 
was preceded by at least eight practice trials- to allow subjects to adapt to the 
target dimension and stimulus set they would receive on that block. Such exten- 
sive practice was designed to maintain performance at a stable optimal level 
during the experiment, while minimizing the possibility of artifact in the elec- 
trical recordings. Between each block of trials subjects received a four to five 
minute rest period, with a longer rest interval following the fourth block of 
trials in each session. An entire session, including electrode application, 
lasted three to four hours. 

Data Analysis 

As described above, the RT on each trial was recorded by the counter-timer, 
translated by the LINC, and punched on paper tape together with the code for the 
stimulus presented on that trial. These tapes were read into a PDP-12 computer 
which sorted the RTs into appropriate categories as a function of subject, dimen- 
sion, and condition. These data were then transferred to digital magnetic tape 
for permanent storage. For statistical analysis a complete four-way factorial 
analysis of variance was computed on the RTs from each experiment (Subjects x 
Conditions x Dimensions x Within) . The RTs were untransf ormed except that RTs 
greater than 1 sec were set equal to 1 sec. This procedure eliminated the few 
very long RTs (less than 3 percent) resulting from failure to make electrical 
contact with the response button, etc. Subsequent individual comparisons between 
main effect and interaction means were made using the Scheffe procedure (Scheffc, 
1960; Winer, 1962). Unless otherwise noted, all statements of statistical sig- 
nificance are the P < .001 level • 

The evoked potentials collected on-line by the LINC during the control con- 
dition in each experiment were averaged separately for each dimension. This pro- 
cedure resulted in an evoked potential for each dimension based on 128 trials at 
each electrode location for each subject. The within-subject data were averaged 
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across subjects to yield a single average for each dimension consisting of 1,536 
trials over the 12 subjects. These across-subject averages correspond directly 
to the data presented by Wood et al. (1971) and constitute the principal neuro- 
physiological data for each experiment. The statistical significance of differ- 
ences between evoked potentials in each experiment was evaluated by computing 
Wilcoxon matched-pairs signed-ranks tests (Siegel, 1956) at each of the 256 time 
points in a pair of evoked potentials at a given electrode location. This pro- 
cedure was used by Wood et al. (1971) not only to determine the statistical reli- 
ability of differences between evoked potentials for the two dimensions, but also 
to determine the precise distribution of significant differences in time relative 
to stimulus onset and subjects' identification responses. 



Experiment 1 was designed to replicate the initial RT and evoked potential 
experiments of Day and Wood (1972a) and Wood et al. (1971) in a single experi- 
ment. The same phonetic and auditory dimensions used in the initial experiments 
were again used in Experiment 1: place of articulation of voiced stop consonants 
(Place) and fundamental frequency (Pitch). Place was selected as the phonetic 
dimension since this cue is an excellent example of the "encoding" of phonetic 
information in the speech signal (Liberman et al. , 1967). In addition, in cate- 
gorical perception and dichotic listening experiments, stop consonants have con- 
sistently produced results which have been interpreted as characteristic of pho- 
netic perception (Liberman et al., 1967; Studdert-Kennedy et al., 1970a; 
Studdert-Kennedy and Shankweiler, 1970; Pisoni, 1971). Pitch was selected as the 
auditory dimension since the absolute fundamental frequency of a syllable conveys 
no phonetic information in English. In addition to replicating the initial ex- 
periments. Experiment 1 assessed their generality by using two-formant instead of 
three-formant synthetic syllables, by using acoustic cues for a different value 
on the Place dimension (/b/ and /g/ instead of /b/ and /d/), and by using formant 
frequencies for a different vowel (/ae/ instead of /a/) than were employed in the 
initial experiments. 



The stimuli for Experiment 1 consisted of the four possible combinations of 
two levels on the Place (/bae/ and /gae/) and Pitch dimensions (104 Hz and 
140 Hz) * Wide- and narrow-band spectrograms of these four stimuli are shown in 
Figures 1 and 2, respectively. The spectrograms are three-dimensional displays 
in which intensity (relative darkness of the display) is plotted as a function 
of frequency (vertical axis) and time (horizontal axis) • ITie two horizontal 
bands of highest energy in each stimulus are called fomants, and are nximbered 
from low to high frequency (i*e., Fl, F2, etc=)* The rapid frequency changes at 
the beginning of each formant are called formant transitions . 

Pairs of stimuli which differed on the Place dimension differed only in the 
direction and extent of the F2 transition, as shown by the two spectrograms on 
the left versus the two on the right of Figures 1 and 2. All acoustic parameters 
of these pairs of stimuli were identical except for the initial 45 msec of F2. 
The direction and extent of the F2 transition is the acoustic cue important for 
distinguishing among voiced stop consonants (Liberman, Delattre, Cooper, and 
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Figure 1: Wide-band spectrograms of the four synthetic stimuli for Experiment 1. 
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Figure 2: Narrow-band spectrograms of the same stimuli shown in Figure 1. 
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Gerstman, 1954; Delattre, Libennan, and Cooper, 1955). In the context of the 
vowel /ae/, a rising F2 transition is the cue for /b/ and a sharply falling F2 
transition is the cue for /g/. 

Pairs of stimuli which differed on the Pitch dimension (top versus bottom 
halves of Figures 1 and 2) had identical formant patterns and differed only in 
fundamental frequency (Fq) . The differences in Fq are shown indirectly in both 
Figures 1 and 2, but may be seen more clearly in the narrow-band spectrograms of 
Figures 2. In these spectrograms the bandwidth of the analysis filters was suf- 
ficiently narrow to resolve the individual harmonics of the Fq of each stimulus. 
The harmonics of the 104 Hz stimuli (upper two spectrograms in Figure 2) are more 
closely spaced, reflecting the closer spacing of integral multiples of 104 Hz 
than 140 Hz. 

The assignment of levels on each dimension to response buttons was 104 Hz- 
button 1, 140 Hz-button 2, /bae/-button 1, and /gae/-button 2. This assignment 
was the same for the control and orthogonal conditions for both dimensions. 

Results and Discussion 

Reaction time . Mean RTs for the Place and Pitch dimensions are shown in 
Figure 3, for the control and orthogonal conditions. Each point in the display 
is the mean of 1,536 observations over the 12 subjects, with a given subject con- 
tributing 128 observations to each mean. A 2 x 2 display of this kind will com- 
prise the principal RT results in each experiment. 

As described in the introduction, the question of interest in the RT data is 
whether each dimension could be processed without interference from irrelevant 
variation in the other dimension. For Place, Figure 3 shows that there was a 
substantial increase in RT of 50.1 msec from the control to the orthogonal condi- 
tion, while for Pitch the corresponding dif f erence^T^etween conditions was 0.6 
msec. The statistical reliability of these results may be determined from the 
analysis of variance for this experiment presented in Table 2. The term in the 
analysis corresponding to the 2x2 partition of the data in Figure 3 is the 
Condition x Dimension interaction (B x C in Table 2), which was highly signifi- 
cant. The main effects of Conditions and Dimensions are also significant, but 
these effects can be completely accounted for by the Condition x Dimension inter- 
action. Individual differences among subjects will be considered in a later 
section. 

The Scheffe procedure for individaul comparisons was applied to the Condi- 
tion X Dimension interaction means and showed that a difference of 14.6 msec was 
necessary for significance at the P < .001 level. Thus, the orthogonal condition 
for Place differed significantly from the other three conditions, while the dif- 
ferences among the latter were not significant. The RT results of Experiment 1 
therefore constitute a clear replication of those obtained by Day and Wood 
(1971a), In both experiments there was a significant Condition x Dimension in- 
teraction, indicating that irrelevant variation in Pitch produced significantly 
more interference with the identification of Place than the reverse. 

Evoked potentials . The evoked potential data recorded during the control 
conditions for Place and Pitch in Experiment 1 are shown in Figure 4. Evoked 
potentials from the two tasks are superimposed at each electrode location to 
facilitate visual comparison. Each trace is the average of 1,536 trials (128 
trials for each of the 12 subjects) and corresponds directly to the RT data from 
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Figure 3: Reaction time data for Experiment 1. 
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TABLE 2: Summary of analysis of variance for Experiment 1. 
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Figure 4: Average evoked potentials during identification of Place and Pitch 
in Experiment !♦ 
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the control conditions for Place and Pitch shown in Figure 3. The statistical 
reliability of differences between the evoked potentials for the two dimensions 
was analyzed using the Wilcoxon procedure described in the Method section above. 
The results of this analysis are shown in histogram form directly beneath the 
pair of evoked potentials at each electrode location. Upward deflections from 
baseline in the statistical traces indicate that the obtained difference in 
evoked potentials at that individual' time point was significant at the P < .01 
level . 

In order to analyze neural activity that occurred during the processing of 
each dimension and eliminate from consideration possible differences in neural 
activity associated with the button-press response, the 490 msec averaging epoch 
was empirically divided into pre-response and motor response intervals. On a 
single trial, the time interval during which perceptual processing must have 
occurred begins at stimulus onset and ends with the subject's button-press re- 
sponse. However, in the large number of trials required for evoked potential 
averaging, the precise termination of this "processing interval" is less clear. 
The criterion employed by Wood et al. (1971) to estimate the end of the "process- 
ing interval" was also used here, namely, the time point after which 99 percent 
of the button-press responses occurred. This point is shown for the data in 
Figure 4 by the vertical lines at 207 msec. It should be noted that this criter- 
ion is particularly conservative , since it eliminates from consideration activity 
which occurs in time after the first 1 percent of subjects' motor responses. It 
is conceivable, of course, that differences in neural activity related to per- 
ceptual processing might not occur until the middle of the RT distribution. How- 
ever, in such a case it would be impossible to determine whether these differ- 
ences were related to perceptual processing or to the motor response. 

As described in the introduction, factors such as the acoustic stimuli, 
stimulus presentation probability, subjects' motor response and RT, electrode 
location, and all aspects of the recording apparatus were equated in the evoked 
potentials from the Place and Pitch tasks. Therefore, if no true difference in 
neural activity were produced by the Place and Pitch tasks, then the two evoked 
potentials at each location in Figure 4 would merely be random samples from the 
same population. Under these conditions, the evoked potentials from the Place 
and Pitch tasks should differ only to the extent expected by chance alone. For 
the right hemisphere locations (T4 and C4) shown on the right side of Figure 4, 
this was indeed the case. No more significant differences than would be expected 
by chance occurred at either location. In contrast, at corresponding locations 
over the left hemisphere (T3 and C3) , there were significant differences in 
evoked potentials during the pre-response interval. Miile 2.56 points signifi- 
cant at the P < .01 level would be expected to occur by chance, 48 and 54 sig- 
nificant points were obtained at T3 and C3, respectively. These results provide 
a clear replication of the evoked potential results of Wood et al. (1971). In 
both experiments, significant differences between evoked potentials for Place and 
Pitch were obtained only at left-hemisphere locations. 



EXPERIMENT 2: INTENSITY 



The conclusions of Experiment 1 and the initial experiments would be strength- 
ened by showing that the patterns of RT and evoked potential results for the Place 
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and Pitch dimensions do not occur when neither dimension requires phonetic pro- 
cessing. That is, if the results of these experiments were actually due to dif- 
ferent levels of processing required for the auditory and phonetic dimensions, 
then similar results should not occur when two auditory dimensions are compared. 
Experiment 2 provided such a comparison for the auditory dimensions Pitch and 
Intensity. All conditions of this experiment were identical to Experiment 1, 
except that Intensity was substituted for Place as the second target dimension. 

Stimuli 

Two of the stimuli for this experiment were the syllables /bae/-104 Hz and 
/bae/-l40 Hz shown on the left side of Figures 1 and 2, Thus, the Pitch dimen- 
sion in this experiment (104 Hz versus 140 Hz) was identical to that of Experi- 
ment 1, However, instead of varying in Place, variations in Intensity were pro- 
duced by attenuating the stimuli /bae/-104 Hz and /bae/-l40 Hz by 20 db SL. 
Thus, the four stimuli used in this experiment were 104 Hz-loud, 104 Hz-soft, 
140 Hz-loud, and 140 Ha-soft, each with formant patterns identical to those of the 
syllable /bae/ used in Experiment 1, The 20 db SL attenuation was produced rela- 
tive to each subject's individual 65 db SL signal level by interposing a Hewlett 
Packard Model 350B decade attenuator in the stimulus circuit under LINC relay 
control. The assignment of response buttons to levels on each dimension was 
similar to Experiment 1: 104 Hz-button 1, 140 Hz-button 2, loud-button 1, and 
soft-button 2, 

Results and Discussion 

Reaction time > Mean RTs for Pitch and Intensity are shown in Figure 5, 
There was a large increase in RT from the control to the orthogonal condition for 
both dimensions. For Pitch the increase was 42,7 msec and for Intensity the 
increase was 36 ♦I msec. The results of the analysis of variance shown in Table 3 
indicate that the difference between control and orthogonal conditions was highly 
significant, while there was no difference between the Pitch and Intensity dir.en- 
sions and no Condition x Dimension interaction. The Scheffe procedure for indiv- 
idual comparisons shaved that a difference of 15.3 msec was required for signifi- 
cance at the P < ,001 level.. These results demonstrate that irrelevant variation 
in Pitch and Intensity each produced substantial interference with the identifica- 
tion of each other. This pattern of results is identical to that obtained for 
"integral" pairs of aLlmulus dimensions by Gamer and Felfoldy (1970). Such 
mutual interference suggests that both dimensions are automatically processed on 
each trial, regardless of which dimension suojects are required to identify by 
the processing tasks. 

Evoked potentials . The evoked potentials recorded during the control condi- 
tions of Experiment 2 are shown in Figure 6. In this experiment, the time point 
which divides the pre-response and motor response intervals was 209 msec, as 
shown by the vertical lines in Figure 6. The evoked potentials for Pitch and 
Intensity were not significantly different at any electrode location. These data 
are therefore consistent with the RT results presented above in that neither re- 
sponse measure provides evidence that different levels of processing are required 
for the two auditory dimensions. In isolation, these "negative" results could 
have been produced by measurement insensitivity. However, in the context of the 
significant differences in both response measures obtained in Experiment 1, one 
may be confident that the results of this experiment were not due to imprecise 
measurement. 
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TABLE 3: Summary of analysis of variance for Experiment 2 
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Figure 6: Average evoked potentials during identification of Intensity and 
Pitch in Experiment 2. 
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EXPERIMENT 3; SECOND FQRMANT TRANSITIONS 



As shown in the spectrograms of the stimuli in Experiment 1 (Figures 1 and 
2) , the F2 transition was the only acoustic difference between the syllables 
/bae/ and /gae/ at each level of the Pitch dimension. Since these pairs of stim- 
uli were acoustically identical in all other respects, the F2 transitions (or 
some portion thereof) must have been the acoustic basis for the discrimination 
between /bae/ and /gae/ in the Place task. Therefore, the phonetic level of pro- 
cessing must also be directly related to the F2 transitions. 

Two alternative modes of processing could characterize the additional level 
of processing for Place demonstrated for Experiment 1: an auditory mode and a 
phonetic mode. According to the auditory mode, what has been termed the "phonetic" 
level would actually be an auditory process specialized to detect particular 
acoustic cues in the speech signal. The F2 transition is one acoustic cue that 
such an auditory process would undoubtedly be specialized to detect, since this 
cue "is probably the single most important carrier of linguistic information in 
the speech signal" (Liberman et al., 1967). From this point of view, the differ- 
ences between Place and Pitch in Experiment 1 would be attributed to different 
processing requirements for the F2 transitions: to be correct in the Place task 
subjects had to process the F2 transitions, while they could have performed the 
Pitch task correctly without processing the F2 transitions. 

Alternatively, the phonetic mode would suggest that the additional level of 
processing would be specialized for the extraction of phonetic features rather 
than the detection of particular acoustic events in the speech signal. Instead 
of attributing the results of Experiment 1 to the requirement for processing of 
the F2 transitions per se, the phonetic mode would suggest that the important 
difference between Place and Pitch in Experiment 1 was that the F2 transitions 
occurred in phonetic context and cued a phonetic distinction. 

Experiment 3 sought to distinguish empirically between the auditory and 
phonetic mod^w described in the previous paragraphs. By isolating the F2 transi- 
tions from the syllable context of the stimuli in Experiment 1, it was possible to 
require identification of the same F2 transitions and the same levels on the 
Pitch dimension, while eliminating the phonetic distinction normally cued by the 
F2 transitions. Thus, this experimeaL investigated whether the differences be- 
tween Place and Pitch in Experiment 1 were due to the processing of the F2 
transitions per se, or rather to the fact that they cued a phonetic distinction. 
If the results of Experiment 1 were produced by the F2 transitions per se, then 
identical results should be obtained in the present experiment. 

Stimuli 

The four stimuli for Experiment 3 were the isolated F2 transitions of the 
CV syllables in Experiment 1, shown in Figures 1 and 2. These stimuli were con- 
structed by eliminating from the CV syllables all portions of the stimuli which 
were identical for /bae/ and /gae/. Figure 7 presents an example of this pro- 
cess. The top of Figure 7 shows a wide-band spectrogram of the syllable /bae/- 
104 Hz, identical to that shown in the upper left of Figure 1. The corresponding 
F2 transition for this syllable used in Experiment 3 is shown in the bottom of 
Figure 7. Note that all of Fl and the steady-state vowel portion of F2 have been 
eliminated, leaving only the F2 transition. Thus, precisely the same acoustic 
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Figure 7: An example of the relationship between the CV syllables used in 
Experiment 1 and the F2 transitions used in Experiment 3. 
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information was available for distinguishing between /b/ and /g/ in this experi- 
ment and Experiment 1, the difference being the presence or absence of redundant 
phonetic context. Phenomeno logic ally, isolated F2 transitions are perceived as 
nonspeech "chirps" (Liberman et al., 1967; Mattingly et al. , 1971). The result- 
ing set of four stimuli was composed of two levels on the Pitch dimension (104 Hz 
and 140 Hz) and two F2 transitions (rising, f;xtracted from the syllable /bae/; 
and falling, extracted from the syllable /gae/). 

The assignment of response buttons to stimuli in this experiment was: 
104 Hz-button 1, 140 Hz-button 2, rising F2 transition-button 1, and falling F2 
transition-button 2. Note that for the Pitch dimension this assignment is iden- 
tical to that in Experiment 1, and that each F2 transition was assigned the same 
response as the corresponding syllable in Experiment 1. To avoid biasing the way 
subjects perceived the isolated F2 transitions, they were not told how the stim- 
uli related to those of Experiment 1, and the neutral label "quality" was used to 
refer to the variation in the F2 transitions. Subjects were therefore free to 
distinguish between rising and falling F2 transitions in any manner they wished. 

Results and Discussion 

Reaction time . Mean RTs for the identification of Pitch and the F2 transi- 
tions are shown in Figure 8. For Pitch, there was an increase of 65.7 msec from 
the control to the orthogonal condition, while for the F2 transitions there was 
an increase of 53.8 msec. The analysis of variance (Table 4) and the Scheffe 
procedure showed that both differences were significant. The main effect of 
Conditions was highly significant, irtiile the main effect of Dimensions and the 
Condition x Dimension interaction did not reach significance. The results of the 
Scheffe procedure showed that a difference of 19.8 msec between any pair of means 
in Figure 8 was necessary for significance at the P. < .001 level. Thus, there 
was no difference between Pitch and the F2 transitions in either condition, and 
all of the variance in the 2x2 partition of the data shown in Figure 8 was 
associated with the difference between control and orthogonal conditions. 

Evoked potentials . The evoked potentials recorded during the identifica- 
tion of the F2 transitions and Pitch are shown in Figure 9. In this experiment 
the pre-response and motor response intervals were divided at 199 msec. As shown 
by the Wilcoxon traces, there were no more significant differences than would be 
expected by chance alone at any electrode location. 

Thus, neither the RT nor the evoked potential results for the F2 transitions 
in this experiment duplicated those of the Place dimension in Experiment 1. 
These results strongly suggest that the same F2 transitions were perceived dif- 
ferently in isolation as opposed to when they occurred in syllable context and 
cued a phonetic distinction. Moreover, both the RT and evoked potential results 
in the present experiment correspond exactly to those in Experiment 2 for Pitch 
and for another auditory dimension. Intensity. Therefore, the perception of the 
F2 transitions in isolation corresponded much more closely to the perception of 
Intensity than to the perception of the same F2 transitions when they cued a 
phonetic distinction. 
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Figure 8: Reaction time data for Experiment 3. 
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TABLE 4: Sunnnary of analysis of variance for Experiment 3. 
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Figure 9: Averaged evoked potentials during identification of F2 transitions 
and Pitch in Experiment 3. 
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EXPERIMENT 4; PITCH CONTOUR 



The F2 transitions which cued the phonetic distinction between /bae/ and 
/gae/ in Experiment 1 are examples of cues which are said to be highly "encoded" 
or "restructured." These terms refer to acoustic segments in the speech signal 
which transmit information In parallel about multiple phonetic segments, and 
which therefore undergo considerable context-conditioned variation as a function 
of their phonetic environment (cf. Liberman et al., 1967; Studdert-Kennedy and 
Shankweiler, 1970; Haggard, 1971; Darwin, 1971a). In contrast to such highly 
"encoded" cues, there are other acoustic parameters which carry linguistic in- 
formation but which undergo much less context-conditioned variation. For these 
cues there is more or less a one-to-one correspondence between a given acoustic 
parameter and the linguistic distinction cued by that parameter. At the phonetic 
level, examples of such "unencoded" cues are the frequency positions of formants 
as cues for isolated vowels (Delattre, Liberman, Cooper, and Gerstman, 1952; 
Peterson and Barney, 1952) , and the frequency position of friction noises as cues 
for certain fricatives (Hughes and Halle, 1956; Harris, 1958). 

Another relatively "unencoded" linguistic cue is the direction of change in 
fundamental frequency of terminal portions of an utterance. Terminal change in 
fundamental frequency (here called Pitch Contour) is the most important single 
cue for judging whether an utterance was a question or statement (Lieberman, 
1967, in press; Fry, 1968; Lehiste, 1970; Studdert-Kennedy and Hadding, 1971). 
In contrast to the "unencoded" cues at the phonetic level described above. Pitch 
Contour can occur over longer durations than a single syllable and the perceived 
direction of terminal Pitch Contour can be influenced by the Pitch level of 
earlier syllables in the utterance (Hadding-Koch and Studdert-Kennedy, 1964; 
Studdert-Kennedy and Hadding, 1971) . 

Experiment 4 investigated whether the additional level of processing demon- 
strated by the RT and evoked potential data of Experiment 1 was required for the 
identification of rising and falling Pitch Contour. Since a given change in 
Pitch Contour is judged in a similar manner both when it is carried by a speech 
signal and a pure tone (Studdert-Kennedy and Hadding, 1971), this experiment in- 
vestigated an acoustic parameter which conveys linguistic information in an 
"unencoded" auditory form rather than the highly "encoded" form represented by 
the cues for Place in Experiment 1. In addition. Pitch Contour was also selected 
for this experiment in order to evaluate the use of the Pitch dimension as the 
baseline auditory dimension in Experiments 1-3 and the initial experiments. It 
is possible that the use of Pitch as a nonlinguistic dimension may be inappropri- 
ate, since changes in Fitch (i.e.. Pitch Contour) can convey linguistic informa- 
tion. This experiment therefore provided a basis for comparing the results for 
Pitch and Pitch Contour with dimensions that are clearly linguistic (Place in 
Experiment 1) and nonlinguistic (Intensity in Experiment 2) . 

Stimuli 

Narrow^band spectrograms of the stimuli for Experiment 4 are shown in 
Figure 10. The two stimuli on the left side of this figure are identical to the 
syllables /bae/-104 Hz and /bae/-140 Hz used in Experiment 1 and shown on the 
left sides of Figures 1 and 2. As Indicated by the falling harmonics in the 
spectrograms, these two stimili had a Pitch Contour which gradually fell over the 
300 msec duration of the stimuli. The right half of Figure 10 shows the remaining 
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140 Hz - FALL 140 Hz - RISE 



Figure 10: Narrow-band spectrograms of the four stimuli used In Experiment 4. 
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two stimuli for this experiment. These stimuli are identical to those on the 
left in all respects, except for their Pitch Contour. In the stimuli on the 
right the Pitch Contour is gradually rising instead of falling. Thus, the stim- 
uli for Experiment 4 varied in Pitch (104 Hz versus 140 Hz) and Pitch Contour 
(falling contour, the cue for statement; and rising contour, the cue for ques- 
tion) . Assignments of levels on the two dimensions to response buttons in this 
experiment were: 104 Hz-button 1, 140 Hz-button 2, statement-button 1, and 
question-button 2. 

Results and Discussion 

Reaction time . Mean RTs for the identification of Pitch and Pitch Contour 
are shown in Figure 11. There was a large increase in RT from the control to 
orthogonal conditions for both dimensions: 101.5 msec for Pitch and 101.1 msec 
foi' Pitch Contour. In the analysis of variance shown in Table 5, the main effect 
of conditions was highly significant, while there was no significant Condition x 
Dimension interaction. These results suggest that Pitch and Fitch Contour inter- 
fere mutually, in a manner similar to Pitch and Intensity in Experiment 2. 

However, in contrast to the previous experiments. Figure 11 shows that there 
was a large difference in RT between Pitch and Pitch Contour in the control condi- 
tion as well. The statistical reliability of this difference is indicated by the 
significant main effect of Dimensions in absence of a Condition x Dimension inter- 
action, and by the value of 17.8 msec required for significance at the P < .001 
level according to the Scheffe procedure. This main effect difference between 
Pi* h and Pitch Contour demonstrates the necessity for obtaining data from both 
the orthogonal and control conditions for unambiguous interpretation of differ- 
ences in RT between a given pair of dimensions. The results of the orthogonal 
condition in this experiment are equivalent to those of Experiment 1 for Place 
and Pitch, with Pitch significantly faster than the other dimension in both 
experiments. However, a comparable difference between Pitch and Pitch Contour 
in the control condition indicates that the results of the orthogonal condition 
do not reflect differential interferen'" b between dimensions, but rather that it 
took approximately 80 msec longer in both conditions to identify the direction 
of Pitch change than the absolute level of the Pitch itself. 

Evoked potentials . The evoked potential data recorded during the identifi- 
cation of Pitch and Pitch Contour are shown in Figure 12. In this experiment 
the pre-response and the motor response intervals were divided at 204 msec. In 
the pre-response interval there were no more significant differences between the 
evoked potentials for Pitch and Pitch Contour than would be expected by chance 
at any location. Thus the evoked potential differences characteristic of Place 
and Pitch in Experiment 1 did not occur in this experiment. 

However, in the motor response interval there were significant differences 
between dimensions at all four electrode locations. This pattern of results pro- 
vides an excellent illustration of the value in distinguishing between pre- 
response and motor response intervals in the evoked potential analysis* In 
absence of such a distinction, differences such a*^ those in Figure 12 might be 
attributed to differences in the perceptual processing of the two dimensions, 
in a manner similar to that in Experiment 1 and the experiment of Wood et al. 
(1971). Clearly, however, there is an alternative explanation for differences 
during the motor response interval. Since the evoked potential differences be- 
tween Pitch and Pitch Contour did not occur until after subjects began to make 
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Figure 11: Reaction time data for Experiment k. 
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TABLE 5: Summary of analysis of variance for Experiment 4. 



SOURCE 


df 


MS 


F 


Subjects (A) 


11 


1657009.45 


182.06* 


A X D 


1397 


9101.24 




Conditions (B) 


1 


15752017.25 


587.78* 


B X D 


127 


26799.38 




Dimensions (C) 


1 


9080791.63 


571.40* 


C X D 


127 


15892.06 




Within (D) 


127 


33872.06 




A X B 


11 


115329.08 


13.59* 


A X B X D 


1397 


8481.84 




A X C 


11 


155983.69 


22.82* 


A X C X D 


1397 


6834.86 




B X C 


1 


270.13 


.02 


B X C X D 


127 


14038.41 




A X B X C 


11 


53567.78 


5.95* 


A X B X C X D 


1397 


9007.67 




TOTAL 


6143 







*P < .001 
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Figure 12: Average evoked potentials during identification of Pitch Contour 
and Pitch in Experiment 4. 
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their button-press responses, these could reflect differences between dimensions 
either in perceptual processes or in RT, or both. This ambiguity is precisely 
the reason that Wood et al. (1971) noted that evoked potential differences could 
not be attributed to perceptual variables in cases where RT differences were also 
obtained. / 

Evidence from other experiments suggests that the differences between Pitch 
and Pitch Contour in Figure 12 may be related to differences in RT. Using a 
simple RT task, Bostock and Jarvis (1970) obtained averaged evoked potentials 
separately for the fastest, middle, and slowest thirds of the RT distribution. 
The resulting evoked potentials showed significant differences as a function of 
RT, and the form of the differences closely paralleled those shown in Figure 12. 
Evoked potentials from the shortest RT trials were positive in polarity relative 
to those from longer RT trials, in the same 200-400 msec latency range as the 
differences in Figure 12. Thus, differences very similar to those between Pitch 
and Pitch Contour in Figure 12 can be obtained as a function of RT differences 
alone, without differences in the perceptual task. 

In summary, neither perceptual or motor response variables can be eliminated 
as possible sources of the differences between Pitch and Pitch Contour in 
Figure 12. However, it is likely that motor response and RT factors were in- 
volved. In any case, regardless of the source of the evoked potential differ- 
ences in this experiment, they do not parallel those in the pre-response interval 
between Place and Pitch in Experiment 1. Therefore, the RT and evoked potential 
data from Experiment 4 are consistent in suggesting that the identification of 
Pitch Contour does not require the additional level of processing required for 
the Place dimension in Experiment 1. 



ADDITIONAL ANALYSES OF NEURAL ACTIVITY DURING THE 
PROCESSING OF AUDITORY AND PHONETIC DIMENSIONS 



The technique of signal averaging used to obtain the evoked potential data 
presented above is only one of a number of methods for investigating possible 
differences in neural activity between two processing tasks. This particular 
technique was selected for the initial evoked potential experiment (Wood et al., 
1971) because it is designed to increase the signal^-to-noise'^ratio of activity 
synchronized to the onset of the acoustic stimuli relative to the "noise" of the 
background EEG. For general discussions of signal averaging in relation to 
evoked potentials, see Geisler (1960), Ruchkin (1965), Vaughan (1966), Perry and 
Childers (1969), and Regan (1972). 

However, it is possible that the averaging procedure is neither the most 
sensitive nor the most appropriate method for the analysis of differences in 
neural activity between tasks. This possibility may be related to any or all of 
the following consequences of the signal averaging procedure as used in the pre- 
sent investigation: 1) it automatically eliminates consideration of the back- 
ground EEG as a measure of possible differences in neural activity between tasks; 
2) it precludes analysis of intervals in time other than the 490 msec sampling 
epoch immediately following stimulus onset; and 3) by synchronizing on the onset 
of the acoustic stimuli, this procedure implicitly assumes that the neural events 
of interest are in fact sjmchronized to stimulus onset. The fact that the 
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averaging procedure actually resolved differences between the Place and Pitch 
tasks in Experiment 1 and the experiment of Wood et al. (1971) is evidence of its 
sensitivity and appropriateness under these conditions. However, the success of 
the averaging procedure does not eliminate the possibility that other methods 
might be equally or more successful. 

In addition to the possibility that differences in neural activity between 
the Place and Pitch tasks might be reflected in measures other than the averaged 
evoked potential, a more serious possibility should be considered. The obtained 
differences in evoked potentials .could have been indirect effects of differences 
in other forms of neural acitvity between the Place and Pitch tasks. A brief 
example will serve to illustrate this possibility. 

A number of experiments have reported that evoked potential amplitude is en- 
hanced under conditions related to psychological variables such as "attention," 
"stimulus significance," "task relevance," etc. This enhancement effect has been 
predominantly observed in a component of the evoked potential with positive po- 
larity and a latency of approximately 300 msec, and has therefore been referred 
to as the "P300" effect (see discussions by Hillyard, Squires, Bauer, and Lindsay, 
1971; Ritter, Simson, and Vaughan, 1972; Squires, Hillyard, and Lindsay, 1973). 
Donchin and Smith (1970) and Karlin (1970) have pointed out that the "P300" effect 
could in principle be produced by the offset of direct-current potentials (the 
"contingent negative variation" or CNV; Tecce, 1972) which are known to occur in 
time before stimulus onset in these tasks; that is, by activity outside the 
evoked potential averaging epoch. If this suggestion were correct, then the 
"P300" enhancement effect would be an electrical artifact instead of an enhance- 
ment of neural activity related to the perceptual processing of stimulus informa- 
tion. 

In an explicit analysis of this question, Donald and Goff (1971) attempted 
to determine whether the "P300" effect could be completely accounted for by pre- 
stimulus CNV. Their results showed that although evoked potential amplitude 
changes covaried with CNV amplitude as suggested by Donchin and Smith (1970) and 
Karlin (1970), the "P300" enhancement effect was still obtained when amplitude 
differences in the CNV were statistically eliminated. Thus, although part of the 
evoked potential variance was associated with the CNV, a direct evoked potential 
effect was obtained as well. 

The analyses reported in the present section were designed to investigate 
the two possible relations between the evoked potential results of Experiment 1 
and other measures of neural activity: .1) that the evoked potential differences 
between Place and Pitch might be indirect effects of differences in other measures 
of neural activity; or 2) that these differences might be direct effects of the 
Place-Pitch manipulation but might also be accompanied by parallel differences in 
other measures of neural activity as well. It should be emphasized that in 
either case the conclusion that different neural activity occurred during the 
identification of Place and Pitch would still be completely valid. The presence 
of the evoked potential differences in Experiment 1 for Place and their absence 
in Experiment 2 for Intensity rule out the possibility that they were associated 
with logical or experimental artifacts. Thus, the present analyses attempted to 
determine whether the differences in neural activity during phonetic and auditory 
processing tasks are evoked potential differences per se, or are also present in 
other measures of neural activity. 
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The four analyses in this section were based on the tape recorded EEG for 
the same blocks of trials In the control condition used to obtain the evoked 
potentials for Place and Pitch of Experiment 1 (Figure 4), This procedure pro- 
vides a strong test of the possibility that other changes In neural activity 
were associated with the obtained evoked potential differences, since these addi- 
tional analyses were computed from precisely the same raw data as were the evoked 
potentials. Therefore, if differences in the background EEG, for example, were 
associated with the obtained differences in evoked potentials, then such changes 
should be clearly evident in an explicit analysis of the EEG from the same block 
of trials. 

The four analyses described below certainly do not exhaust all possible 
ways of comparing neural activity between the Place and Pitch tasks. However, 
they do represent measures of neural activity that have previously been shown to 
be sensitive to various perceptual variables, and to be possible sources of in- 
direct changes in evoked potentials. The first two analyses considered possible 
differences in the background EEG from which the evoked potentials were extracted 
by signal averaging: 1) an analysis of the component frequencies of the EEG 
using spectral analysis techniques, and 2) an analysis of the amplitude distribu- 
tion of the EEG using goodness-of-fit tests of EEG amplitude histograms. The 
third and fourth analyses, like the evoked potential data presented above, were 
based on signal averaging techniques: 3) averaging over the entire 5-sec inter- 
trial interval between successive stimuli to detect possible baseline differences 
related to the CNV, and 4) synchronizing the averaging process on the subjects* 
button-press response instead of stimulus onset. Because of the magnitude of the 
computations required for most of these analyses, only one of the two electrode 
locations over each hemisphere was selected for analysis (C3 and C4) . 

Spectral Analysis of Background EEG 

The first method of investigating possible differences in background EEG 
between the Place and Pitch tasks was to decompose the EEG into its frequency 
components by the use of spectral analysis techniques. General discussions of 
spectral analysis techniques are given by Blackman and Tukey (1958) , Bendat and 
Piersol (1966) , and Jenkins and Watts (1968) , while specific applications to EEG 
data are presented by Walter (1963); Walter and Adey (1963); Walter, Rhodes, 
Brown, and Adey (1966); and Dumermuth, Walz, Scollo-Lavizzari, and Kleiner (1972). 
The resulting frequency spectra plot relative energy in the EEG signal as a func- 
tion of frequency. The EEG from the Place and Pitch tasks in Experiment 1 was 
submitted to spectral analysis and the resulting spectra were compared statisti- 
cally to determine whether there were significant differences between tasks 
associated with the significant evoked potential differences presented above. 

Method . For each subject an average frequency spectrum was computed for 
the Place and Pitch tasks at electrode locations over the left (C3) and right 
hemisphere (C4) . The EEG recorded on magnetic tape during each session was 
played into the analog-to-digital converter of a PDP-12 computer, which simul- 
taneously digitized the analog EEG signals from both locations at a rate of 
1,024 samples per second (9.73 msec per point) and stored the values on digital 
magnetic tape. Each block of 64 trials lasted approximately 5.33 minutes, re- 
sulting in 128 256-sample magnetic tape blocks for each block of 64 trials. To 
avoid aliasing (Blackman and Tukey, 1958; Bendat and Piersol, 1966; Jenkins and 
Watts, 1968), each channel of EEG was low-pass filtered at 40 Hz before input 
to the analog-to-digital converters by a Krohn-Hlte Model 3322 two-channel 
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variable filter (attenuation of 24 db per octave). Following analog-to-digital 
conversion of the EEG for each block of trials, the frequency spectrum for that 
block was computed using a Fast Fourier Transform (Cooley and Tukey, 1965) , im- 
plemented with a modified version of programs developed by J. S. Bryan (DECUS 
No. L-25) . The spectra computed in this manner covered the frequency range from 
DC to 40 Hz in 0.2 Hz intervals. Each resulting spectrum was stored on digital 
magnetic tape for later statistical analysis. 

Results . Examples of frequency spectra from a single subject during the 
Place task are shown in Figure 13 for the left hemisphere (C3) and Figure 14 for 
the right hemisphere (C4). The shaded spectrum at the bottom of each figure is 
the average for the Place task for that subject which was entered into the 
across-subject statistical analysis. Above these averages in each figure are 
three-dimensional plots of individual spectra compuued over successive 20-sec 
epochs during the two blocks of trials in the Place task. For purposes of visual 
presentation, the individual spectra were smoothed prior to plotting by a three- 
point smoothing algorithm. However, the average spectra shown at the bottom of 
each figure were not smoothed prior to plotting. Therefore, the spectra for each 
subject entered into the statistical analysis were the actual computed spectra 
without transformations of any kind. In Figures 13 and 14, spectra from the 
first block of Place trials (the initial 5.33 min in each figure) are followed 
immediately by those from the second block, even though these blocks did not 
follow each other directly during the experiment. The data are presented in this 
way to illustrate the overall stability of the frequency composition of the EEG 
signal during the two blocks of trials for each task, and to show that the aver- 
age spectrum over the two blocks does no injustice to any individual spectral 
estimate within either block. 

The across-subject averages for the Place and Pitch tasks are shovm in 
Figure 15, for the left (C3) and right (C4) hemisphere locations. These data are 
directly analogous to the evoked potential data for C3 and C4 in the Place and 
Pitch tasks presented in Figure 4 above. In contrast to the significant differ- 
ences in evoked potentials which occurred on these same blocks of trials, there 
were no significant differences in the EEG spectra at either location. Visual 
inspection indicates an almost complete overlap between the spectra for the Place 
and Pitch tasks throughout the frequency range. This conclusion was substanti- 
ated by the results of Wilcoxon tests, identical to those described above for the 
evoked potential data, which were computed at each 0.2 Hz frequency interval in 
the spectra. These results indicate that there were no significant differences 
in the background EEG between the Place and Pitch tasks, either in overall 
energy level, or in the energy within any specific frequency band. Thus, these 
data eliminate the. possibility that the evoked potential differences between 
Place and Pitch in Experiment 1 could have been indirect effects of generalized 
differences in background EEG between tasks (cf. discussions by Broughton, 1969; 
and Regan, 1972) . 

Gaussian Characteristics of Background EEG: Analysis of EEG Amplitude Histograms 

The results of the spectral analyses described in the previous section do 
not rule out the possibility that the evoked potential dirferences might have 
been accompanied by differences in other measures of background EEG. The present 
section considers an additional measure, the proportion of time during the Place 
and Pitch tasks that the EEG was distributed according to a Gaussian (normal) 
distribution. 
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Figure 13: Frequency spectra from the left hemisphere (C3) during identification 
of Place. 
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Figure 14: Frequency spectra from the right hemisphere (C4) which occurred 
simultaneously with those shown in Figure 13. 
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Figure 15: Average frequency spectra across subjects during identification of 
Place and Pitch. 



This analysis was based on a series of investigations by Elul (1967, 1968, 
1969), which suggested that inferences about cooperative interactions among indiv- 
idual neurons could be made from the amplitude distribution of the gross EEG. 
The logic of this inference rests upon the statistical relationship between the 
gross EEG recorded from the cortical surface or the scalp, and slow-wave activity 
recorded intra- or extracellular ly from single cortical neurons (Elul, 1967, 1968, 
1969). In both the cat (Elul, 1967, 1968) and the human (Saunders, 1963; Elul, 
1969), the gross EEG is normally distributed, provided suitable epoch lengths are 
used to compute the distributions. In contrast, the distributions of slow-wave 
activity recorded from individual neurons are clearly not normal (Elul, 1967, 
1968). Elul suggested that this relationship could be accounted for if the indiv- 
idual potential fields summated according to that statistical central limit 
the orem (cf . Cramer, 1955) i "the sum of a large number of probability distribu- 
tions always tends to assume, a normal distribution, regardless of the nature of 
the component distributions, provided only that these original distributions are 
independent, or at least nonlinearly related, possess a mean, and a finite stan- 
dard deviation. (Therefore) the EEG may be accounted for as the normally dis- 
tributed output ensuing from combination of the activity of many independent (or 
nonlinearly related) neuronal generators" (Elul, 1968). 

If the above interpretation for Gaussian characteristics of the EEG were 
correct, then situations in which the EEG were not normally distributed would 
indicate changes in the nature of the interaction between the individual genera- 
tors. For example, Elul (1969) computed amplitude distributions of the EEG from 
the same electror'a location on the human scalp under two conditions: 1) when the 
subject was resting quietly, and 2) when performing a mental arithmetic task. 
The EEG in the resting condition was normally distributed approximately 66 percent 
of the tn.me, while during the mental arithmetic task the percentage decreased to 
32 percent. Based on the interpretation of EEG distributions outlined above, 
Elul (1969) argued that the differences between these proportions reflected an 
"increase in the cooperat ive activity of cortical neuronnl elements during per- 
formance of a mental task," 

The analysis of EEC amplitude histograms to be described below was a 
straightforward adaptation of the experiment of Elul (1969) in order to investi- 
gate possible hemisphere differences in the background EEG during the Place and 
Pitch tasks. It should be noted that while there is strong evidence the postsyn- 
aptic slow-wave activity is the major neuronal source of the surface EEG (cf . 
Jasper and Stefanis, 1965; Creutzfeldt, Watanabe, and Lux, 1966a, 1966b; 
Creutzfeldt, 1970; Pollen, 1970), the inference from these data that Gaussian and 
non-Gaussian EEG distributions indicate the degree of "cooperative activity of 
cortical neuronal elements" remains largely unverified by empirical data. How- 
ever, since empirical differences in EEG distributions for Place and Pitch would 
be important independent of Elul's interpretation, questions about the validity 
of the interpretation do not decrease the validity of the empirical comparison. 

Method . The same digitized EEG from the Place and Pitch tasks entered into 
the spectral analyses in the previous section was used to compute the EEG ampli- 
tude histograms. For each subject, goodness-of-f it tests to a Gaussian distribu- 
tion were computed on EEG segments of 2.5-sec duration throughout each block of 
trials in the Place and Pitrh tasks. Measures of skewness and kurtosis were com- 
puted on each 2.5-sec segment of EEG, resulting in 256 tests for skewness and 
256 for kurtosis on each electrode location in each task. The null hypothesis of 
a Gaussian distribution was rejected whenever P < .05 on either measure. Skew- 
ness was evaluated using the Pearson Bi statistic, where: Bl = ^3/^^, and 
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yj^ =» Z(x£ - x)'^/N. Similarly, kurtosis was evaluated using the Pearson B2 sta- 
tistic, where: B2 = ^4/^2 > defined as above. 

Results , The results of the goodness-of-f it tests are shown in Table 6, 
which presents for each subject the percentage of the 156 2.5~sec EEG segments 
in each task in which the null hypothesis of a Gaussian distribution was rejected. 
As shown in the means at bottom of the table, the Gaussian assumption was re- 
jected in approximately half the segments in both the Place and Pitch tasks at 
both left a'lid right hemisphere locations. The difference between Place and Pitch 
tasks failed to reach statistical significance at either the left or right hemi- 
sphere locations (P > .10 T^Jilcoxon tests). Thus, like the results of the spec- 
tral analyses presented above, these data suggest that the evoked potential dif- 
ferences between Place and Pitch in Experiment 1 were not associated with differ- 
ences in the background EEG. 

Signal Averaging of Activity During the Intertrial Interval Between Successive 
Stimuli 

The possibility was raised above that the evoked potential differences be- 
tween Place and Pitch in Experiment 1 could have been associated with prestimulus 
CNV differences between the two tasks. The CNV ("contingent negative variation," 
Walker, Cooper, Aldridge, McCallum, and Winter, 1964) is a prolonged surface- 
negative baseline shift in the EEG which, under certain conditions, preceeds a 
stimulus that requires a response by the subject (see reviews by Cohen, 1969, and 
Tecce, 1972). Typically, CNV experiments have employed one stimulus as a "warn- 
ing stimulus" which is followed by a "task stimulus" after an interval of usually 
2-4 sec. The CNV baseline shift develops during the interval between the "warn- 
ing" and "task" stimuli, and has been related to concepts such as "expectancy," 
"anticipation," and "preparation" (see Walter et al. , 1964; Cohen, 1969; McAdam, 
Knott, and Rebert, 1969; Donald, 1968, 1970; for a discussion of CNV in para- 
digms more familiar to cognitive psychologists, see Posner and Boies, 1971; 
Posner, Klein, Summers, and Baggie, 1973). 

However, the narrowly defined "warning stimulus-task stimulus" paradigm 
commonly used to study CNV is not the only situation in which baseline shifts 
similar to the CNV may be obtained. A number of authors have noted that the 
most important prerequisites for obtaining CNV are: 1) that the "task stimulus" 
require some processing or response by the subject; and 2) that the "task stimu- 
lus" be predictable in time from some preceding event (cf . Naatanen, 1967, 1969; 
Cohen, 1969; Karlin, 1970; Donchin and Smith, 1970; Donald and Goff, 1971) • 
Therefore, tasks in which successive stimuli are presented at fixed intertrial 
intervals are, in principle, sufficient for the development of CNV during the 
intertrial intervals • In this situation, each stimulus in the sequence would 
serve both as the "task stimulus" requiring a response by the subject, and also 
as the "warning stimulus" for the next stimulus in the sequence (cf. Naatanen, 
1967, 1969; Donchin and Smith, 1970). 

Thus, the identification tasks used in the present investigation could have 
conceivably produced prestimulus baseline shifts in the EEG similar to the CNV. 
In addition, as described in the introduction of this section above, it is possi- 
ble for apparent evoked potential differences to be associated with prestimulus 
differences in CNV (Naatanen, 1967, 1969; Karlin, 1970; Donchin and Smith, 1970; 
Donald and Goff, 1971). Therefore, the present analysis investigated the possi- 
bility that the evoked potential differences between Place and Pitch in Experiment 
1 were associated with differences in CNV between the two tasks. 
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TABLE 6: Percentage of EEG segments In which the Gaussian 
h3rpothesis was rejected. 



Left Hemisphere (C) Right Hemisphere (C4) 



Subj ect 


Place 


Pitch 


Place 


Pitch 


1 


52.0 


52.3 


52.0 


55.7 


2 


51.2 


39.1 


49.2 


36.5 


3 


69.9 


69.9 


67.9 


67.2 


4 


46.5 


42.4 


65.3 


55.8 


5 


36.3 


35.9 


35.2 


37.7 


6 


41.1 


42.8 


38.5 


46.1 


7 


45.7 


44.9 


48.0 


50.0 


8 


48.9 


50.2 


49.8 


41.4 


9 


46.5 


67.1 


44.5 


66.8 


10 


67.6 


39.2 


38.4 


66.0 


11 


40.4 


43.4 


44.7 


45.3 


12 


45.9 


40.4 


38.1 


36.1 


Mean 


49.3 


47.3 


47.6 


50.4 



Method. The EEG from the control condition for Place and Pitch was aver- 
aged in a manner similar to that used to obtain the evoked potential data pre- 
sented above. However, instead of averaging only 490 msec following each stimu- 
lus, the entire interstimulus interval between successive stimuli was averaged in 
order to observe possible prestimulus baseline shifts. Since the inter trial 
interval between stimuli was 5 sec, the sampling epoch for this analysis was made 
to be 6,144 msec, therefore including 2 successive stimulus presentations plus 
the entire intertrial interval. Thus, for a given subject trials 1 and 2 were 
included in the first epoch, trials 3 and 4 in the second epoch, and so on. 
Since each subject received 2 blocks of 64 trials in the control condition for 
each dimension, this procedure resulted in an average of 64 2-stimulus trials in 
the Place and Pitch tasks for each subject. 

The four channels of EEG (T3, C3, C4, and T4) were played from the Honeywell 
tape recorder into the analog- to-digital converter of a PDP-12 computer. The 
pulses on a separate tape channel synchronized to stimulus onset were used to 
trigger the averaging process. When a trigger occurred on tape the computer be- 
gan to sample the four EEG channels simultaneously at a rate of 24 msec per point 
for 256 successive points, resulting in epochs of 6,144 msec. The averages were 
stored on magnetic tape and were later analyzed using the Wilcoxon statistical 
procedure used previously for the evoked potential data. 

Results . Figure 16 presents the averaged activity during the intertrial 
intervals of the Place and Pitch tasks at the four electrode locations. Note 
that two stimulus presentations are included in each trace. One presentation 
occurred ac the beginning of the traces where sampling was initiated, and the 
second presentation occurred near the end of the trace following the 5-sec inter- 
trial interval. The results of the Wilcoxon analyses for these data indicated 
that no more significant points than would be expected by chance alone occurred 
at any electrode location, despite the apparent differences between Place and 
Pitch at left hemisphere locations during the "evoked potential" portions of each 
trace. While 2.56 significant points would be expected to occur in each trace by 
chance alone, 4 and 5 significants were obtained at T3 and C3, respectively, and 
2 and 1 significant points were obtained at C4 and T4. 

It is interesting to note the time intervals in which the significant points 
occurred, even though they were not sufficient in number for rejection of the' 
null hypothesis. At the two left hemisphere locations (T3 and C3) , eight of the 
nine significant points were clustered in each trace at approximately 100-200 
msec following the onset of the acoustic stimuli; that is, in the same time 
interval in which significant differences were obtained in the 490 msec sampling 
epochs (Figure 4). In contrast, the significant points at right hemisphere 
locations (T4 and C4) were apparently random with respect to stimuli in each 
trace. 

These data indicate that the evoked potential differences between Place and 
Pitch in Experiment 1 were not accompanied by differences between tasks during ' 
the intertrial interval. Therefore, the evoked potential differences could not 
have been produced by an indirect influence of the CNV (Naatanen, 1967, 1969; 
Karlin, 1970; Donchin and Smith, 1970; Donald and Goff, 1971). 
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Figure 16: Averaged activity during the intertrial intervals between 
successive stimuli. 
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signal Averaging Synchronized to Subjects^ Button-Press Responses Instead of 
Stimulus Onset 



The signal averaging analysis in both the 490 msec and 6,144 msec sampling 
epochs was based on the in5>licit assumption that differences in neural activity 
between Place and Pitch would be synchronized to the onset of the acoustic stim- 
uli. The fact that significant differences between dimensions were obtained by 
sjmchronizing upon stimulus onset indicates that this assumption was valid at 
least to some degree ♦ However, it is possible that the differences between 
Place and Pitch might be better synchronized to the "end" of perceptual process- 
ing rather than the beginning as implicitly assumed by the previous analyses. 
The present analysis investigated this possibility by synchronizing the signal 
averaging process to subjects' button-press responses. 

A second reason for this analysis was to investigate more directly the pos- 
sibility that the evoked potential differences between Place and Pitch might have 
been indirect effects of differences in the motor response between tasks ♦ As 
pointed out by Wood at al. (1971), both the actual button-press responses and RT 
were nominally the same in the Place and Pitch tasks: the same buttons were 
equally distributed across stimuli in both tasks, and there were no significant 
differences in RX between tasks. In addition. Wood et al. (1971) showed that 
partitioning the evoked potential data into blocks for fast versus slow RT did 
not produce evoked potential differences in the preresponse interval similar to 
those obtained when the data were partitioned into blocks for Place and Pitch. 

Despite such indirect evidence against an explanation of the evoked poten- 
tial results in motor response terms, it is still possible for such an explana- 
tion to be correct. For example, slight differences in the degree of pressure 
exerted for "identical" button-press responses in the two tasks might have been 
sufficient to produce the obtained differences in evoked potentials. Therefore, 
the present analysis investigated this possibility more directly, by determining 
whether neural activity synchronized to the button-press response was signifi- 
cantly different in the two tasks. 

Method. The same raw EEG from the control conditions for Place and Pitch 
used in previous analyses was played from the Honeywell tape recorder into the 
analog- to-digital converter of a LING computer. Two electrode locations were 
analyzed, one over the left hemisphere (C3) and one over the right hemisphere 
(C4) . The LING was programmed to sample and average 1.28 sec before and after 
each trigger pulse at a sampling rate of 10 msec per point. The pulses used to 
trigger the LING were those which had been generated simultaneously with subjects' 
button-press responses and recorded on a separate tape channel during each task* 
For each subject, this procedure resulted in an average of 128 trials in the con- 
trol conditions for Place and Pitch at each electrode location, corresponding to 
the evoked potential data from Experiment 1 shown in Figure 4. These data were 
then entered into the Wilcoxon statistical analysis to determine whether there 
were significant differences between the Place and Pitch tasks* 

Results . The across-subjec- averages for these data are shown in Figure 17. 
Note that in contrast to pxovious figures, the point of syncUronization in 
Figure 17 is the middle of the time scale (shown by the arrow), so that activity 
both before and after the button-press response is shown. Visual inspection of 
these data suggests that there were no differences between the Place and Pitch 
tasks. This conclusion was substantiated by the results of the Wilcoxon analysis 
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Figure 17; Averaged activity synchronized to subjects' button-press responses. 
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which showed that no more significant points than would be expected by chance 
alone occurred at either location. Two significant points were obtained at C3 
and zero significant points were obtained at C4. These results indicate that 
possible differences in activity produced by the button-press response could not 
have produced the significant differences between Place and Pitch in the pre- 
response interval in Experiment 1. Moreover, they indicate that the neural 
activity associated with phonetic processing was better synchronized to stimulus 
onset than the button-press response* 
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GENERAL DISCUSSION 



Sunmiary of Experiments 

The present investigation had three major goals: 1) to specify in greater 
detail the nature of the acoustic stimuli and processing tasks responsible for 
the RT and evoked potential differences between auditory and phonetic dimensions 
obtained by Day and Wood (1971a) and Wood et al. (1971); 2) to make a stronger 
test of the convergence of the RT and evoked potential measures upon the distinc- 
tion between auditory and phonetic levels of processing; and 3) to obtain addi- 
tional information about the neurophysiologic^l characteristics of the neural 
activity associated with phonetic processing,," The results of the four main 
experiments and the additional neurophysiological analyses will be summarized 
briefly in the context of these goals. 

Experiment 1 was a replication of the separate initial experiments for Place 
and Pitch in a single experiment. Both the RT and evoked potential results of 
Experiment 1 confirmed those obtained in the initial experiments. These results 
provide strong support for the conclusion that differences in both response 
measures reflect different levels of pirocessing required for the Place and Pitch 
dimensions. In addition, the results of Experiment 1 lend further generality to 
those of the initial experiments since it employed two-formant instead of three- 
formant synthetic syllables, a different level on the Place dimension, and a 
different vowel context. 

Experiment 2 was a control experiment designed to insure that the differences 
between Place and Pitch in Experiment 1 were due to different levels of processing 
for the auditory and phonetic dimensions. If this were the case, then two audi- 
tory dimensions should not have produced tfee RT and evoked potential differences 
attributed to the phonetic level of processing. All conditions of Experiment 2 
were identical to those of Experiment 1, except that a second auditory dimension. 
Intensity, was substituted for the phonetic dimension Place. In contrast to the 
RT and evoked potential differences between Place and Pitch, there were no dif- . 
ferences between Intensity and Pitch in either response mea-'ure. These results 
indicate that the differences between Place and Pitch were not artifacts of some 
aspect of the experimental design or measurement techniques, since these were 
identical in both experiments. 

Thus, Experiments 1 and 2 provide a firm empirical basis for the conclusion 
that the differences in RT and evoked potentials between Place and Pitch reflect 
different levels of processing involved in the identification of the two dimen- 
sions. Characteristics of the Place dimension which might have been responsible 
for the differential results of Experiments 1 and 2 were investigated in Experi- 
ments 3 and 4. 

Experiment 3 investigated the acoustic cue for the Place distinction, the 
F2 transition, in isolation rather than in phonetic context as ii Experiment 1. 
The stimuli for Experiment 3 were those used in Experiment 1, but with all 
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Identical portions of the stimuli /bae/ and /gae/ eliminated. Therefore, all 
acoustic information that could have been used to distinguish between /bae/ and 
/i^ae/ in Experiment 1 was also present in the stimuli for Experiment 3, with 
only the redundant phonetic context removed. The RT and evoked potential data 
for Experiment 3 were identical to those for Pitch and Intensity in Experiment 2: 
there were no differences between Pitch and the F2 transitions in either response 
measure. These results suggest that it was not the processing of the F2 transi- 
tions per se that produced the differences between Place and Pitch, since the 
same F2 transitions had to be processed in Experiment 3. Rather, the important 
difference between Experiments 1 and 3 was that in Experiment 1 the F2 transi- 
tions occurred in phonetic context and cued a phonetic distinction, while in 
Experiment 3 they occurred in isolation and did not cue a phonetic distinction. 
Therefore, the phonetic level of processing appears to be specialized for the 
extraction of phonetic information and not for the processing of particular 
acoustic events in the speech signal. 

Experiment 4 compared Pitch with Pitch Contour, a dimension which is bas- 
ically auditory like Pitch and Intensity, but which is the cue for a linguistic 
distinction under some conditions. The RT and evoked potential results of this 
experiment were more comparable to those for auditory dimensions in Experiments 2 
and 3 than for Place in Experiment 1. Therefore, the mere fact that an acoustic 
parameter is known to be the cue for a linguistic distinction under certain con- 
ditions does not necessarily imply that the perception of this parameter requires 
processing in addition to the auditory level. 

The EEG from which the evoked potentials for Place and Pitch in Experiment 1 
were averaged was subjected to additional analyses in order to determine: 
1) whether the evoked potential differences were indirect effects of changes in 
other measures of neural activity; or 2) whether these were valid differences but 
were accompanied by parallel differences in other measures of neural activity. 
The results of all four additional analyses were consistent. There were no dif- 
ferences between the Place and Pitch tasks in background EEG (assessed by two 
independent techniques), no differences in averaged activity during the inter- 
trial intervals, and no differences in activity synchronized to the subjects* 
button-press response. These results suggest that the differences In neural 
activity between auditory and phonetic dimensions are limited to the actual inter- 
val during which the phonetic processing occurs, and that this activity is better 
synchronized to the onset of a speech stimulus than to a motor response about 
that stimulus. 

Relative Discriminability of the Stimulus Dimensions in Each Experiment 

The principal difference between the RT data of Experiment 1 and Experiments 
2-4 was that in the latter there were mutual interference effects between Pitch 
and the second dimension in each experiment. In contrast, the interference be- 
tween Pitch and Place in Experiment 1 was unidirectional in that irrelevant vari- 
ation in Pitch significantly interfered with the identification of Place but not 
the reverse. This unidirectional interference effect has been attributed to the 
hypothesis that identification of Place involved an additional level of process- 
ing over that required for the identification of Pitch. However, an alternative 
factor should be considered: namely, the relative discriminability of the stimu- 
lus dimensions in each experiment. 
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Imai and Garner (1965) showed that speed of card-sorting was highly depen- 
dent upon the discriminability of the stimulus dimension used to define the sort- 
ing categories. Discriininability was varied by varying the physical distance be- 
tween the two levels on a given dimension, and the effects were measured in a two- 
category card-sorting task. Sorting times were found to decrease significantly 
with increases in discriminability, until an asymptotic minimum was reached. 
Further increases in discriminability beyond the value necessary to reach this 
asymptote would therefore have little effect on sorting speed. The same general 
relation between speed and discriminability obtained by Imai and Gamer (1965) 
would be expected to occur in discrete RT tasks such as those used in the present 
investigation (cf. Biederman and Checkosky, 1970; Biederman and Kaplan, 1970; 
Well, 1971; Biederman, 1972). Therefore, it is important to ask whether the 
failure of Place to interfere with Pitch in Experiment 1 could have been due to 
differences in discriminability rather than subjects' ability to process Pitch 
selectively without processing Place as suggested above. 

First, it should be pointed out that the particular levels on the dimensions 
in all four experiments were selected a priori to be as equal in discriminability 
as possible. The absence of significant differences between the control RTs with- 
in the first three experiments, and the marked similarity of control RTs across 
the four experiments suggest that this attempt was successful. However, this 
evidence is not conclusive, since it would be possible for two dimensions to have 
equal control RTs and yet differ in discriminability. This could have happened 
in Experiment 1, for example, if the Pitch dimension were so highly discriminable 
that it fell on the asymptotic minimum of the hypothetical RT-discriminability 
function (Imai and Garner, 1965). In this case irrelevant variations in Place 
could have interfered with the processing of Pitch, but not enough to alter the 
RT for Pitch identification. 

Two kinds of evidence suggest that RT for Pitch was not at an asymptotic 
minimum in Experiment 1, 1) The same levels on the Pitch dimension used in 
Experiment 1 were also used in Experiments 2-4, in which there were large differ- 
ences in RT for Pitch between the control and orthogonal conditions. Since Pitch 
discriminability was identical in all four experiments, differences in their re- 
sults cannot be attributed to differences in Pitch discriminability across experi- 
ments. 2) Direct evidance that Pitch was not an asymptotic miniwT was obtained 
in an experijient by Wood (1973), using the srme acoustic stimuli the same 
control and orthogonal conditions used in Experiment 1 above. The results from 
this part of the experiment were identical to those for Place and Pitch in Ex- 
periment 1, desp5xte the fact that control RTs averaged 25 msec faster. In addi- 
tion, a third ccriiition was employed in which the levels on the Place and Pitch 
dimensions were cbrapletely correlated. In this correlated condition, RTs for 
both Place and Pitch were significantly faster than the control conditions by 
more than 40 mseb- These results clearly indicate that RT for Pitch was not at 
an asymptotic minimum in Experiment 1, and therefore could not have produced the 
unidirectional interference between Place and Pitch. 

A second way that relative discriminability of Place and Pitch could have 
affected the results of Experiment 1 would be if Pitch were in the middle of the 
RT-discriminability function (as the data above indicate), but Place were far 
less discriminable* Again, the equal RTs for Place and Pitch provide suggestive, 
but not conclusive, evidence that Place and Pitch did not differ in discrimin- 
ability. In addition, the experiment of Day and Wood (1972b) showed that irrele- 
vant variations in Place produced substantial interference with identification of 
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vowels, a dimension which was far more discriminable than Pitch in the present 
experiments. Therefore, neither relatively high discriminability of Pitch nor 
relatively low discriminability of Place could have been responsible for the uni- 
directional interference between these dimensions in Experiment 1. 

Individual Differences 

Previous speech perception experiments have obtained systematic individual 
differences in the degree to which subjects are perceptually "bound" by the con- 
straints of the language (Day, 1970). That is, some subjects appear to be able 
to "disengage" the linguistic processing of speech sounds and process the acous- 
tic aspects of the stimuli. Other subjects, however, have difficulty with such 
a task and their perception of speech stimuli appears to be governed to a much 
greater extent by linguistic constraints. In light of these data^ it is impor- 
tant to consider the possible role of individual differences in the present 
experiments. 

The occurrence of individual differences may be evaluated directly in the 
analysis of variance for each experiment (Tables 2-5). In all four experiments, 
the main effect of subjects was highly significant, indicating that individual 
subjects differed considerably in their baseline RTs. In Experiment 1, for ex- 
ample, the main effect means for subjects ranged from 337 msec to 531 msec. In 
addition, the interactions between subjects and the other factors in each experi- 
ment were also significant. The Subject x Condition interaction was marginally 
significant at the P < .05 level in Experiments 1 and 2 and was clearly signifi- 
cant in Experiments 3 and A, while the Subject x Dimension interaction was sig- 
nificant in all four experiments. Finally, the Subject x Condition x Dimension 
interaction was marginally significant at the P < .05 level in Experiment 1 and 
significant in Experiments 2- A. 

These significant interactions between subjects and the other factors in 
each experiment indicate that individual subjects differed both in the particular 
stimulus dimension identified faster in each experiment, and also in the absolute 
magnitude of the interference produced by each dimension. A similar pattern of 
individual differences was obtained by Garner and Felfoldy (1970) in card-sorting 
experiments with conditions analogous to the control and orthogonal conditions of 
the present experiments. In addition. Garner .^nd Felfoldy (1970) found that 
amount of interference was not significantly correlated with the sorting speed 
for the single dimensions. A similar analysis was performed separately for each 
dimension in the four present experiments with identical results. There were no 
significant correlations between the RT for a given dimension in the control 
condition and the interference produced by that dimension when it was irrelevant 
in the orthogonal condition. Garner and Felfoldy (1970) noted that significant 
correlations between interference and base speed might have been expected if 
there were an identical relation between discriminability and processing speed 
for all subjects. The results of the present experiments support their conclu- 
sion that more complex speed-discriminability relations batween subjects appear 
to be involved. 

Finally, the degree to which subjects differed in their ability to "disen- 
gage linguistic processing (Day, 1970) in the present experiments may be evalu- 
ated in the data for Place and Pitch in Experiment 1. It should be pointed out 
that the overall pattern of results shovm in Figure 3 clearly indicates that 
over the entire experiment the subjects as a group were able to identify Pitch 
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with minimal interference from irrelevant variation in Place. However, this 
pattern of results in the group means does not preclude the possibility of indiv- 
idual differences in this pattern among subjects. The appropriate term for this 
question in the analysis of variance (Table 2) is the Subject x Condition x 
Dimension Interaction, which indicates whether the 2x2 partition of the data 
shown in Figure 3 differed significantly across sublects. This interaction was 
barely significant at the P < .05 level, indicating a slight tendency toward dif- 
ferent amounts of interference for different subjects. However, like the signifi- 
cant Subject X Condition x Dimension interactions in the other three experiments 
(Tables 3-5) , this interaction mainly reflects the sensitivity of the within- 
subject design to differences in the magnitude of interference for different sub- 
jects. Since there was much more interference in the orthogonal condition for 
Place than Pitch for all 12 subjects, the pattern of results shown in Figure 3 
is a valid summary of the data for any individual. The subjects in the present 
experiment were extremely well practiced (see Method section), and it is possible 
that more systematic individual differences might have been observable in earlier 
stages of practice. Preliminary experiments are under way to investigate this 
question further. 

Relation of the Present Experiments to the Distinction Between Integral and Non- 
integral Stimulus Dimensions 

Following previous distinctions by Torgerson (1958), Attneave (1962), 
Shepard (1964), Lockhead (1966), and Hyman and Well (1968), Garner and Felfoldy 
(1970) distinguished between "integral" and "nonintegral" stimulus dimensions: 
"...the distinction phenomenologically being between (nonintegral) dimensions 
which can be pulled apart, seen as unrelated, or analyzable, and these (integral 
dimensions) which cannot be analyzed but are somehow perceived as single dimen- 
sions" (p. 325). The integral-nonintegral distinction resolves a number of 
apparently conflicting findings obtained when the particular stimulus dimensions 
employed in a given experiment are not taken into account (Garner, 1970). 

In operational terms, integral dimensions "...produce a redundancy gain 
when the dimensions are correlated and some measure of speed or accuracy of dis- 
crimination is used, and produce interference in speed of classification when 
selective attention is required with orthogonal stimulus dimensions" (Gamer and 
Felfoldy, 1970;3.?8). In contrast, nonintegral dimensions produce no redundancy 
gain when correla^ted and no interference \Aien orthogonal. The control and 
orthogonal conditions of the present experiments were directly analogous to two 
of the three conditions used by Garner and Felfoldy (1970) to distinguish empir- 
ically between integral and nonintegral dimensions. Therefore, It is of interest 
to compare the results for the dimensions in the present cjxperiments with those 
t3^ical of integral and nonintegral dimensions in the experiment of Gamer and 
Felfoldy (1970). 

In the discrete RT tasks of the present experiments, integral dimensions 
would be expected to produce increased RT in the orthogonal conditions relative 
to the control conditions for both dimensions. This pattern of results is ex- 
actly what was obtained in Experiment: 2-4. The correspondence between the re- 
sults of Experiments 2-4 in the auditory modality and the integral visual dimen- 
sions studied by Garner and Felfoldy (1970) suggests that the concept of stimulus 
integrality is not limited to the visual modality but is a more general charac- 
teristic of human information processing. The fact that .two physical dimensions 
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can be manipulated independently in the environment does not guarantee that they 
are perceptually independent (cf. Gamer and Morton, 1969). 

In contrast to the similarity between Experiments 2-4 and results typical of 
integral dimensions, the results for Place and Pitch in Experiment 1 were clearly 
not consistent with either the integral or nonintegral dimensions of Garner and 
Felfoldy (1970) ♦ Day and Wood (1972a) referred to the pattern of results for 
Place and Pitch as unidirectional, as opposed to mutual, interference between di- 
mensions. These results appear to be the only case to date of unidirectional 
interference in speeded classification tasks of this type. However, previous ex- 
periments using different perceptual tasks have obtained results which may be re- 
lated. 

For example, Egeth and Pachella (1969, Experiment 4) required subjects to 
make absolute judgments of three dimensions (color, size, and eccentricity) of 
ellipses. Their results showed clear differences in the interference produced by 
color and the other two dimensions, just as Place and Pitch differed in this re- 
spect in the. present investigation. In a condition in which all three dimensions 
varied orthogonally, judgment of color was unimpaired relative to a condition in 
which color varied alone. In contrast, judgments of both size and eccentricity 
were significantly impaired in the orthogonal condition relative to the corre- 
sponding single dimension conditions. By implying a directional dependence be- 
tween dimensions, this lack of mutual interference between color and the other 
dimensions in an absolute judgment task is formally equivalent to the lack of 
mutual interference between dimensions in a speeded classification task. 

Another result which may be related to the unidirectional interference be- 
tween Place and Pitch is based on the "physical-match" versus "name-match" para- 
digm of Posner and Mitchell (1967; Posner, 1969). In these experiments, subjects 
were presented pairs of letters and had to indicate whether they were "same" or 
"different" as rapidly as possible on each trial. On some trials the same- 
different decision was to be based on the physical identity of the letters 
(e,g,, AA) , while on other '.rials it was to be based on their name identity 
(e.g., Aa), whether they were the same or different physically.. Subjects took 
significantly longer to decide that the same stimulus pair differed in name than 
physical identity, leading Posner and Mitchell (1967) to conclude that an addi- 
tional level of processing was required for the name match over that required for 
the physical match. 

The data of Egeth and Pachella (1969) and Posner and Mitchell (1967), and the 
unidirectional interference between Place and Pitch (Day and Wood, 1972a, Experi- 
ment 1) indicate that the integral-nonintegral distinction alone is not sufficient 
to account for situations in which different levels of processing may be involved 
in the perception of different stimulus attributes. In addition to consideration 
of the particular dimensions in a given experiment as advocated by the integral- 
nonintegral distinction, consideration must also be given to the particular role 
in the experiment played by each dimension . As long as the results for both 
members of a pair of dimensions are equivalent, as in the experiments of Garner 
and Felfoldy (1970), there is no need to distinguish between individual members 
of the pair. However, when different dimensions of the same physical stimuli 
produce reliably different results, such a distinction is clearly required. 
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Converging Operations and the Concept of Specialized Neural Mechanisms for 
Speech Perception 

Evidence from a number of sources converges upon the distinction between 
auditory and phonetic levels of processing, and upon the idea that the phonetic 
level involves specialized neural mechanisms which are lateralized in one cere- 
bral hemisphere: 

1) the right-ear advantage in dichotic listening (Studdert-Kennedy 
and Shankweiler, 1970; Darwin, 1971a; Haggard, 1971; and refer- 
ences cited therein) ; 

2) the tendency for phoneme discrimination to be limited by linguis- 
tic categories rather than physical differences in the stimuli 
(i.e., "categorical perception," Liberman et al., 1967; Studdert- 
Kennedy et al., 1970a; Pisoni, 1971); 

3) the difference in temporal-order judgment accuracy for dichoti- 
cally presented linguistic and nonlinguistic dimensions (Day and 
Bartlett, 1971j and references cited therein); 

4) the unidirectional interference between auditory and phonetic 
dimensions of the same speech stimuli in speeded-classlf icatlon 
tasks (Day and Wood, 1972e; RT data of Experiment 1); and 

5) the differences in neural activity over the left hemisphere dur- 
ing the processing of auditory and phonetic dimensions of the 
same speech stimuli (Wood et al., 1971; evoked potential data of 
Experiment 1) . 

The "lag effect" for the identification of temporally offset dichotic syl- 
lables (Studdert-Kennedy et al., 1970b) has been proposed by some authors as an- 
other result reflecting the operation of the phonetic level (cf. Liberman et al. , 
1971; Klrsteln, 1971). However, while the "lag effect" Is a consistent finding 
for speech stimuli (Berlin et al., 1970; Lowe, Cullen, Thompson, Berlin, 
Klrkpatrlck, and Ryan, 1970; Studdert-Kennedy et al., 1970b; Darwin, 1971b), It 
has been obtained with nonspeech stimuli as well (Darwin, 1971b). 

Recently, different investigators have used the strategy of varying the 
stimuli and tasks in these paradigms to determine the particular characteristics 
which distinguish the phonetic level of processing from the general auditory 
system. Although precisely the same experiments have not been conducted in all 
four paradigms, the pattern of results across paradigms forms a consistent. If 
incomplete, characterization of the phonetic level of processing. 

The first Important observation about the nature of the phonetic level is 
that not all linguistic dimensions require specialized phonetic processing for 
their perception. Using both natural and synthetic speech, a number of experi- 
ments have obtained results with certain linguistic dimensions which are more 
typical of those obtained with nonspeech. For example, while stop consonants 
have consistently produced evidence of phonetic level processing, steady-state 
vowels have not. Vowels have not produced a right-ear advantage in dichotic 
listening (Studdert-Kennedy and Shankweiler, 1970; Darwin, 1969, 1971a); they 
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have not resulted in the phoneme boundary effect of categorical perception 
(Liberman, Harris, Hoffman, and Griffith, 1957; Fry, Abramson, Eimas, and 
Liberman, 1962; Stevens, Liberman, Studdert-Kennedy, and Ohman, 1969; Pisoni, 
1971) ; and they have not produced the unidirectional interference with the audi- 
tory dimension Pitch in speeded-classif ication tasks (Day and Wood, in prepara- 
tion) . Evoked potential data for vowels corresponding to the experiments of 
Wood et al. (1971) and Experiment 1 for stop consonants are not available at 
present. It is interesting to note in this context that consonants and vowels 
also differ in characteristics of short- term memory (Crowder, 1971; Pisoni, 1971; 
Cole, 1973). 

The differences between stop consonants and vowels have led to the sugges- 
tion that the phonetic level may be related to the nature of the acoustic cues 
which underlie the perception of these stimuli (Liberman et al., 1967). The cues 
for the stop consonants are highly "encoded" in the sense that information about 
both the consonant and adjacent vowel is transmitted simultaneously in the speech 
signal. Therefore, large variations in the acoustic cues for stop consonants 
occur commonly as a function of phonetic environment (Liberman et al., 1967). 
Thus, the mapping from sound to phoneme for stop consonants is both one-to-many 
and many-to-one. In contrast, the acoustic cues for vowels are much less depen- 
dent upon their phonetic context, and as a result the phonetic categories for 
vowels are more directly related to the acoustic cues. Based on these observa- 
tions, Liberman et al. (1967) and Studdert-Kennedy and Shankweiler (1970), among 
others, have suggested that a crucial determinant for the involvement of the 
phonetic level may be the "encodedness" of the phonemes in a given experiment. 
Support for this h3rpothesis comes from a number of different sources in addition 
to the basic difference between the highly "encoded" stop consonants and the 
"unencoded" vowels. 

Right-ear advantages can be obtained for vowels under conditions which pre- 
sumably increase their "encodedness" by decreasing the one-to-one relationship 
between the vowoil categories and their acoustic cues. Darwin (1971a) obtained a 
significant righL-ear advantage for vowels when the stimuli varied in vocal tract 
size, but no ear advantage when the stimuli were from a single vocal tract. 
Similar results were obtained by Haggard (l'i='71) with vowels that varied in funda- 
mental frequency as well as vocal tract size. Spellacy and Blumstein (1970) 
found a significant right-ear advantage for dichotic vowels when they were em- 
bedded in a sequence which included dichotically contrasting stop consonants, but 
a left-ear advantage for the same vowel stimuli when embedded in a series of 
dichotic nonspeech stimuli. Thus, vowels can yield a right-ear advantage similar 
to that obtained for stop consonants, but only under conditions which appear to 
require more complex "decoding" than is normally required for isolated steady- 
state vowels. 

Day and Vigorito (1973) and Cutting (1973) assumed an "encodedness continuum" 
in -which "stop cons nants appear to be the most highly encoded speech sounds, 
vowels the least encoded, with other sounds in the middle" (Day and Vigorito, 
1973). In both temrjoral-order judgment (Day and Vigorito, 1973) and ear-moritor- 
ing tasks (Cutting, 1973), the magnitude of the right-ear advantage paralleled 
the degree of "encodedness" for stops, liquids, and vowels: stops produced large 
right-ear advantages, liquids produced reduced right-ear advantages, and vowels 
produced either no ear advantage" or one in favor of the left. Darwin (1971a) 
found that identification of dichotic fricatives produced a right-ear advantage 
when cued by friction noise plus formant transitions, but did not produce a right- 
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ear advantage when cued by the friction noise alone. Friction noises are rela- 
tively "unencoded" since they show little variation as a function of phonetic 
context, while the formant transitions for fricatives are "encoded" just as they 
are for stop consonants. Therefore, the experiment of Darwin (1971a) demon- 
strates that the use of the same phonemic response categories can produce differ- 
ent ear advantages, depending upon the "encodedness" of the acoustic cues which 
underlie the phonemic distinction. 

All of the evidence supporting the "encodedness" hypothesis described above 
comes from the dichotic listening paradigm. Except for the basic difference be- 
tween stop consonants and vowels, variations in "encodedness" have not been 
widely investigated in other paradigms. However, Experiment 4 of the present 
investigation provides suggestive evidence in favor of the "encodedness" hypo- 
thesis from the speeded-classif ication and evoked potential paradigms. The di- 
mension compared with Pitch in this experiment was Pitch Contour, the major cue 
for judging whether an utterance is a statement or question (Lieberman, 1967, in 
press; Fry, 1968; Lehiste, 1970; Studdert-Kennedy and Hadding, 1971). In con- 
trast to the highly "encoded" phonetic cues, the linguistic categories of ques- 
tion and statement correspond much more directly to particular values of the 
acoustic cue. Both the RT and evoked potential data for Pitch Contour indicated 
that the perception of Pitch Contour did not require processing in addition to 
the auditory level: the results were much more similar to those for the non- 
linguistic dimension Intensity, than for the phonetic dimension Place. The data 
of Studdert-Kennedy and Hadding (1971; Hadding-Koch and Studdert-Kennedy, 1964) 
are also consistent with the suggestion that Pitch Contour does not require pro- 
cessing in addition to the auditory level. With only minor exceptions, judgments 
of Pitch Contour were the same when subjects were instructed to judge them on 
auditory grounds (falling versus rising) and on linguistic grounds (statement 
versus question). Moreover, Studdert-Kennedy and Hadding (1971) found that Pitch 
Contour was judged similarly when the Pitch changes were carried by a speech 
stimulus and by a pure tone. 

The experiments described in the previous paragraphs suggest that the pho- 
netic level of processing may be related to the specialized "decoding" process 
required for speech perception. This conclusion has been reached by a number of 
other investigators, and is most clearly summarized by Studdert-Kennedy and 
Shankweiler (1970): "...specialization of the dominant hemisphere in speech per- 
ception is due to its possession of a linguistic device, not to specialized 
capacities for auditory analysis... [W]hile the general auditory system conmion 
to both hemispheres is equipped to extract the auditory parameters of a speech 
signal, the dominant hemisphere may be specialized for the extraction of linguis- 
tic features from those parameters" (p. 579). 

However, there is an alternative interpretation for most of the experiments 
cited above, which is equally consistent with the dzta. With few exceptions, 
the experiments which required processing of highly "encoded" cues also involved 
the processing of formant transitions. Therefore, it is possible that the re- 
sults which have been attributed to a phonetic or linguistic decoding process 
were actually due to the processing of particular acoustic features in stim- 
uli like the formant transitions. The possibility that the phonetic level may be 
specialized for processes which are basically auditory rather than linguistic has 
been noted in passing by other authors (cf. Liberman, 1970; Studdert-Kennedy and 
Shankweiler, 1970), and has been explicitly suggested as a possible basis for 
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speech perception by Abbs and Sussman (1971). Only recently has direct evidence 
regarding this possibility become available. 

In a categorical perception experiment, Mattingly et al. (1971) compared 
the perception of the same F2 transitions under two conditions: 1) when they 
occurred in phonetic context and cued the phonetic distinction among /bae/, /dae/, 
and /gae/ ; and 2) when they occurred in isolation and sounded like nonspeech 
"chirps." If the phonetic level weru specialized for the processing of fomant 
transitions, identical results would be. expected in both conditions. When the F2 
transitions occurred in phonetic context, results typical of categorical percep- 
tion were obtained: high discrimination peaks near the phoneme boundaries with 
performance near chance within the phonemic categories. In contrast, the percep- 
tion of the same F2 transitions was clearly not categorical when they occurred in 
isolation. 

A similar pattern of results for isolated F2 transitions was obtained in the 
speeded-classif ication and evoked potential paradigms in Experiment 3 of the pre- 
sent investigation. In both sets of data the perception of the isolated F2 transi- 
tions more closely resembled the nonspeech dimension Intensity than the same F2 
transitions when they cued a phonetic distinction (Experiment 1). Taken together, 
the results of Mattingly et al. (1971) and Experiment 3 provide strong evidence 
that the phonetic level of processing is specialized for the extraction of 
abstract phonetic features, not for the detection of particular acoustic features 
which occur in speech. While a process formally resembling "feature detection" 
may be involved in speech perception (cf. Eimas and Corbit, 1973), the evidence 
presented above clearly indicates ths^t the features must be specified in linguis- 
tic rather than auditory or acoustic terms (Mattingly et al., 1971; Studdert- 
Kennedy et al., 1972; Eimas and Corbit, 1973; Experiment 3). 

Serial Versus Parallel Organization of Auditory and Phonetic Levels [Abridged] 

The experiments described in the previous section firmly establish the dis- 
tinction between auditory and phonetic levels of processing in speech perception, 
and provide additional information about the specialized decoding process per- 
formed by the phonetic level. However, these experiments provide relatively 
little information concerning the general organization of these two levels or the 
nature of the interaction between them. Two process models appear to be consis- 
tent with existing data: 1) a serial or sequential model, in which the auditory 
level would occur first in sequence followed by the phonetic level; and 2) a 
parallel model in which at least some portion of auditory and phonetic process- 
ing could proceed simultaneously. 

Direct evidence regarding serial versus parallel organization of auditory 
and phonetic levels was obtained in a recent experiment by Wood (1973).* This 
experiment employed control and orthogonal conditions identical to those of the 



*Editor's Note: In the original body of the thesis, the remainder cf this section 
presented a detailed discussion of the rationale, results, and interpretation of 
the Wood (1973) experiment. This section has been abridged here since a complete 
version of the Wood (1973) experiment appears elsewhere in this issue of the 
Raskins Laboratories Status Report on Speech Research (ParaiL^l Processing of 
Auditory and Phonetic Information in Speech Perception). 
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present experiments, and used the same CV syllables varying in Place and Pitch 
shown in Figures 1 and 2 (Experiment 1). The possibility of parallel processing 
of Place and Pitch was evaluated in a correlated condition in which the two di- 
mentions varied in a completely redundant manner. The RTs in the correlated con- 
dition for both Place and Pitch were significantly faster than the corresponding 
single dimension control conditions. This "redundancy gain" was not attributable 
to speed-accuracy trades, to selective serial processing (Garner, 1969; Morton, 
1969; Felfoldy and Gamer, 1971), or to differential transfer between conditions. 
These results are consistent only with a model in which auditory and phonetic in- 
formation can be processed in parallel (cf . Biederman and Checkosky, 1970; 
Lockhead, 1972). 

However, the conclusion that auditory and phonetic processing can occur in 
parallel contradicts the widely held idea that linguistic processing is dependent 
upon preliminary processes performed by the general auditory system (cf . Stevens 
and House, 1970; Studdert-Kennedy and Shankweiler, 1970). As noted by Wood 
(1973), this idea must be correct at least to some degree, since to be perceived 
all acoustic signals must first be transduced by the receptor apparatus. To 
account for these observations. Wood (1973) presented a hypothetical organization 
of auditory and phonetic levels consisting of three components: 1) a common 
peripheral component for the transduction and preliminary analysis of all acoustic 
signals; 2) a "central" auditory component for the additional processing of non- 
linguistic auditory information; and 3) a "central" phonetic component for the 
extraction of abstract phonetic features from the results of the preliminary 
auditory analysis. The second two components would be capable of functioning in 
parallel, but both would be dependent upon the output of the prior peripheral 
processing. According to this hypothesis, the term "auditory level" as used by 
Studdert-Kennedy and as used in the initial portions of the present paper would 
actually consist of two parts, the first occurring before any phonetic processing 
is begun, and the second occurring simultaneously with phonetic processirr;. As a 
working hypothesis, this "hybrid" organization appears to be consistent with the 
evidence that distinguishes between auditory and phonetic levels of processing, 
and with the demonstration that processing of auditory and phonetic information 
can occur in parallel. 
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SUMMARY AND CONCLUSIONS 



The RT and evoked potential data of the present experiments provide two new 
sources of evidence for a distinction between auditory aad phonetic levels of 
processing in speech perception, and provide additional insight into the nature 
of the specialized processes performed by the phonetic level. Experiments 1 and 
2 verified the conclusions of Day and Wood (1972a) and Wood et al. (1971) that 
the differences in RT and evoked potentials between Place and Pitch were the re- 
sult of different levels of processing required for auditory and phonetic dimen- 
sions. Both sets of data suggest that identification of the phonetic dimension 
involved additional processing mechanisms which were not required for identifica- 
tion of an auditory dimension of the same physical stimuli. Experiment 3 showed 
that the phonetic level is specialized for the extraction of abstract phonetic 
features, not for the detection of particular acoustic features in the speech 
signal. However, while the processes performed by the phonetic level are bas- 
ically linguistic in nature. Experiment 4 showed that the phonetic level is not 
required for the processing of all acoustic dimensions that can carry linguistic 
information. The additional neurophysiological analyses demonstrated that the 
differences in neural activity between auditory and phonetic dimensions occurred 
only during the actual processing of the two dimensions, tod were not accompanied 
by more generalized differences in neural activity. Taken together, these experi- 
ments provide a strong set of converging operations upon the distinction between 
auditory and phonetic levels of processing in speech perception, and upon the 
idea that the phonetic level involves specialized linguistic mechanisms which are 
lateralized in one cerebral hemisphere. 
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